Final Project

Data Science Job Listings Analysis

Data Mining Principles

Zijia Cao | Gina Champion | Oluwafemi Fabiyi | Kaihao Fan | Ignas Grabauskas | Nicolas Yanez

In this Colab notebook we will focus on data mining, exploring, and applying natural language processing (NLP) techniques to three datasets that contain information from job listings scraped from Glassdoor. The datasets are available in Kaggle at the following links:

  • Data Scientist Job Postings (n = 3900) (link)

  • Data Analyst Job Postings (n = 2000) (link)

  • Business Analyst Job Postings (n = 4000) (link)

The structure of the three datasets is similar, with some additional features in the business analyst one that will be excluded from the final analysis.

The task is to explore the datasets and extract insights about the differences or similarities between job listings in three highly sought after fields in the Data Science market ecosystem.

Installing Libraries

In [ ]:
import os

# Install Java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install PySpark
! pip install --ignore-installed -q pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed -q spark-nlp

# Install Top2Vec
! pip install top2vec
! pip install top2vec[sentence_encoders]

# Install BERTopic
! pip install bertopic
! pip install bertopic[visualization]
openjdk version "1.8.0_282"
OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~18.04-b08)
OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)
     |████████████████████████████████| 215.7MB 72kB/s 
     |████████████████████████████████| 204kB 24.4MB/s 
  Building wheel for pyspark (setup.py) ... done
     |████████████████████████████████| 143kB 17.0MB/s 
Collecting top2vec
  Downloading https://files.pythonhosted.org/packages/64/b5/d35d1da937a5e544347c4082a3969a320ef31b38a499bbc673c7e1484b6e/top2vec-1.0.23-py3-none-any.whl
Requirement already satisfied: wordcloud in /usr/local/lib/python3.7/dist-packages (from top2vec) (1.5.0)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from top2vec) (1.1.5)
Collecting hdbscan>=0.8.27
  Downloading https://files.pythonhosted.org/packages/32/bb/59a75bc5ac66a9b4f9b8f979e4545af0e98bb1ca4e6ae96b3b956b554223/hdbscan-0.8.27.tar.gz (6.4MB)
     |████████████████████████████████| 6.4MB 15.8MB/s 
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
Requirement already satisfied: umap-learn>=0.5.1 in /usr/local/lib/python3.7/dist-packages (from top2vec) (0.5.1)
Collecting numpy>=1.20.0
  Using cached https://files.pythonhosted.org/packages/70/8a/064b4077e3d793f877e3b77aa64f56fa49a4d37236a53f78ee28be009a16/numpy-1.20.1-cp37-cp37m-manylinux2010_x86_64.whl
Requirement already satisfied: gensim in /usr/local/lib/python3.7/dist-packages (from top2vec) (3.6.0)
Requirement already satisfied: pillow in /usr/local/lib/python3.7/dist-packages (from wordcloud->top2vec) (7.0.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->top2vec) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->top2vec) (2018.9)
Requirement already satisfied: scipy>=1.0 in /usr/local/lib/python3.7/dist-packages (from hdbscan>=0.8.27->top2vec) (1.4.1)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from hdbscan>=0.8.27->top2vec) (1.15.0)
Requirement already satisfied: joblib>=1.0 in /usr/local/lib/python3.7/dist-packages (from hdbscan>=0.8.27->top2vec) (1.0.1)
Requirement already satisfied: scikit-learn>=0.20 in /usr/local/lib/python3.7/dist-packages (from hdbscan>=0.8.27->top2vec) (0.22.2.post1)
Requirement already satisfied: cython>=0.27 in /usr/local/lib/python3.7/dist-packages (from hdbscan>=0.8.27->top2vec) (0.29.22)
Requirement already satisfied: numba>=0.49 in /usr/local/lib/python3.7/dist-packages (from umap-learn>=0.5.1->top2vec) (0.51.2)
Requirement already satisfied: pynndescent>=0.5 in /usr/local/lib/python3.7/dist-packages (from umap-learn>=0.5.1->top2vec) (0.5.2)
Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.7/dist-packages (from gensim->top2vec) (4.2.0)
Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn>=0.5.1->top2vec) (0.34.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn>=0.5.1->top2vec) (54.0.0)
Building wheels for collected packages: hdbscan
  Building wheel for hdbscan (PEP 517) ... done
  Created wheel for hdbscan: filename=hdbscan-0.8.27-cp37-cp37m-linux_x86_64.whl size=2311673 sha256=72c35e4b5c48f39d87db81d08e265436e99025c9d306dc71f7d88223c1cb0fd3
  Stored in directory: /root/.cache/pip/wheels/42/63/fb/314ad6c3b270887a3ecb588b8e5aac50b0fad38ff89bb6dff2
Successfully built hdbscan
ERROR: tensorflow 2.4.1 has requirement numpy~=1.19.2, but you'll have numpy 1.20.1 which is incompatible.
ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.
ERROR: albumentations 0.1.12 has requirement imgaug<0.2.7,>=0.2.5, but you'll have imgaug 0.2.9 which is incompatible.
Installing collected packages: numpy, hdbscan, top2vec
  Found existing installation: numpy 1.19.5
    Uninstalling numpy-1.19.5:
      Successfully uninstalled numpy-1.19.5
Successfully installed hdbscan-0.8.27 numpy-1.20.1 top2vec-1.0.23
Requirement already satisfied: top2vec[sentence_encoders] in /usr/local/lib/python3.7/dist-packages (1.0.23)
Requirement already satisfied: wordcloud in /usr/local/lib/python3.7/dist-packages (from top2vec[sentence_encoders]) (1.5.0)
Requirement already satisfied: gensim in /usr/local/lib/python3.7/dist-packages (from top2vec[sentence_encoders]) (3.6.0)
Requirement already satisfied: hdbscan>=0.8.27 in /usr/local/lib/python3.7/dist-packages (from top2vec[sentence_encoders]) (0.8.27)
Requirement already satisfied: umap-learn>=0.5.1 in /usr/local/lib/python3.7/dist-packages (from top2vec[sentence_encoders]) (0.5.1)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from top2vec[sentence_encoders]) (1.1.5)
Requirement already satisfied: numpy>=1.20.0 in /usr/local/lib/python3.7/dist-packages (from top2vec[sentence_encoders]) (1.20.1)
Requirement already satisfied: tensorflow-hub; extra == "sentence_encoders" in /usr/local/lib/python3.7/dist-packages (from top2vec[sentence_encoders]) (0.11.0)
Requirement already satisfied: tensorflow; extra == "sentence_encoders" in /usr/local/lib/python3.7/dist-packages (from top2vec[sentence_encoders]) (2.4.1)
Collecting tensorflow-text; extra == "sentence_encoders"
  Downloading https://files.pythonhosted.org/packages/b6/c0/c0fed4301f592c3b56638ae7292612c17d91a43891ba1aaf9636d535beae/tensorflow_text-2.4.3-cp37-cp37m-manylinux1_x86_64.whl (3.4MB)
     |████████████████████████████████| 3.4MB 18.7MB/s 
Requirement already satisfied: pillow in /usr/local/lib/python3.7/dist-packages (from wordcloud->top2vec[sentence_encoders]) (7.0.0)
Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.7/dist-packages (from gensim->top2vec[sentence_encoders]) (1.15.0)
Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.7/dist-packages (from gensim->top2vec[sentence_encoders]) (4.2.0)
Requirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.7/dist-packages (from gensim->top2vec[sentence_encoders]) (1.4.1)
Requirement already satisfied: joblib>=1.0 in /usr/local/lib/python3.7/dist-packages (from hdbscan>=0.8.27->top2vec[sentence_encoders]) (1.0.1)
Requirement already satisfied: cython>=0.27 in /usr/local/lib/python3.7/dist-packages (from hdbscan>=0.8.27->top2vec[sentence_encoders]) (0.29.22)
Requirement already satisfied: scikit-learn>=0.20 in /usr/local/lib/python3.7/dist-packages (from hdbscan>=0.8.27->top2vec[sentence_encoders]) (0.22.2.post1)
Requirement already satisfied: numba>=0.49 in /usr/local/lib/python3.7/dist-packages (from umap-learn>=0.5.1->top2vec[sentence_encoders]) (0.51.2)
Requirement already satisfied: pynndescent>=0.5 in /usr/local/lib/python3.7/dist-packages (from umap-learn>=0.5.1->top2vec[sentence_encoders]) (0.5.2)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->top2vec[sentence_encoders]) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->top2vec[sentence_encoders]) (2018.9)
Requirement already satisfied: protobuf>=3.8.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow-hub; extra == "sentence_encoders"->top2vec[sentence_encoders]) (3.12.4)
Requirement already satisfied: typing-extensions~=3.7.4 in /usr/local/lib/python3.7/dist-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (3.7.4.3)
Requirement already satisfied: absl-py~=0.10 in /usr/local/lib/python3.7/dist-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.10.0)
Requirement already satisfied: gast==0.3.3 in /usr/local/lib/python3.7/dist-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.3.3)
Requirement already satisfied: google-pasta~=0.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.2.0)
Requirement already satisfied: termcolor~=1.1.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.1.0)
Requirement already satisfied: tensorflow-estimator<2.5.0,>=2.4.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (2.4.0)
Requirement already satisfied: h5py~=2.10.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (2.10.0)
Requirement already satisfied: wheel~=0.35 in /usr/local/lib/python3.7/dist-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.36.2)
Requirement already satisfied: wrapt~=1.12.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.12.1)
Requirement already satisfied: opt-einsum~=3.3.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (3.3.0)
Requirement already satisfied: flatbuffers~=1.12.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.12)
Requirement already satisfied: keras-preprocessing~=1.1.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.1.2)
Requirement already satisfied: tensorboard~=2.4 in /usr/local/lib/python3.7/dist-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (2.4.1)
Requirement already satisfied: astunparse~=1.6.3 in /usr/local/lib/python3.7/dist-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.6.3)
Requirement already satisfied: grpcio~=1.32.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.32.0)
Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn>=0.5.1->top2vec[sentence_encoders]) (0.34.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn>=0.5.1->top2vec[sentence_encoders]) (54.0.0)
Requirement already satisfied: werkzeug>=0.11.15 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.4->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.0.1)
Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.4->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (2.23.0)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.4->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (3.3.4)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.4->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.4.3)
Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.4->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.8.0)
Requirement already satisfied: google-auth<2,>=1.6.3 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.4->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.27.1)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.21.0->tensorboard~=2.4->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.21.0->tensorboard~=2.4->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (2020.12.5)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.21.0->tensorboard~=2.4->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.21.0->tensorboard~=2.4->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.24.3)
Requirement already satisfied: importlib-metadata; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from markdown>=2.6.8->tensorboard~=2.4->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (3.7.2)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.7/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard~=2.4->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.3.0)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from google-auth<2,>=1.6.3->tensorboard~=2.4->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (4.2.1)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from google-auth<2,>=1.6.3->tensorboard~=2.4->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.2.8)
Requirement already satisfied: rsa<5,>=3.1.4; python_version >= "3.6" in /usr/local/lib/python3.7/dist-packages (from google-auth<2,>=1.6.3->tensorboard~=2.4->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (4.7.2)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata; python_version < "3.8"->markdown>=2.6.8->tensorboard~=2.4->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (3.4.1)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.7/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard~=2.4->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (3.1.0)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.7/dist-packages (from pyasn1-modules>=0.2.1->google-auth<2,>=1.6.3->tensorboard~=2.4->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.4.8)
Installing collected packages: tensorflow-text
Successfully installed tensorflow-text-2.4.3
Collecting bertopic
  Downloading https://files.pythonhosted.org/packages/5d/d5/a9f440c43505dcb68dac06132607c8589118a8cb1346bea60bd9abcc8d3d/bertopic-0.6.0-py2.py3-none-any.whl
Collecting sentence-transformers>=0.4.1
  Downloading https://files.pythonhosted.org/packages/6a/e2/84d6acfcee2d83164149778a33b6bdd1a74e1bcb59b2b2cd1b861359b339/sentence-transformers-0.4.1.2.tar.gz (64kB)
     |████████████████████████████████| 71kB 8.9MB/s 
Requirement already satisfied: numpy>=1.19.2 in /usr/local/lib/python3.7/dist-packages (from bertopic) (1.20.1)
Requirement already satisfied: torch>=1.4.0 in /usr/local/lib/python3.7/dist-packages (from bertopic) (1.8.0+cu101)
Requirement already satisfied: hdbscan>=0.8.27 in /usr/local/lib/python3.7/dist-packages (from bertopic) (0.8.27)
Requirement already satisfied: scikit-learn>=0.22.2.post1 in /usr/local/lib/python3.7/dist-packages (from bertopic) (0.22.2.post1)
Requirement already satisfied: pandas>=1.1.5 in /usr/local/lib/python3.7/dist-packages (from bertopic) (1.1.5)
Requirement already satisfied: umap-learn>=0.5.0 in /usr/local/lib/python3.7/dist-packages (from bertopic) (0.5.1)
Requirement already satisfied: tqdm>=4.41.1 in /usr/local/lib/python3.7/dist-packages (from bertopic) (4.41.1)
Collecting transformers<5.0.0,>=3.1.0
  Downloading https://files.pythonhosted.org/packages/2c/d8/5144b0712f7f82229a8da5983a8fbb8d30cec5fbd5f8d12ffe1854dcea67/transformers-4.4.1-py3-none-any.whl (2.1MB)
     |████████████████████████████████| 2.1MB 36.2MB/s 
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from sentence-transformers>=0.4.1->bertopic) (1.4.1)
Requirement already satisfied: nltk in /usr/local/lib/python3.7/dist-packages (from sentence-transformers>=0.4.1->bertopic) (3.2.5)
Collecting sentencepiece
  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
     |████████████████████████████████| 1.2MB 50.5MB/s 
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from torch>=1.4.0->bertopic) (3.7.4.3)
Requirement already satisfied: joblib>=1.0 in /usr/local/lib/python3.7/dist-packages (from hdbscan>=0.8.27->bertopic) (1.0.1)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from hdbscan>=0.8.27->bertopic) (1.15.0)
Requirement already satisfied: cython>=0.27 in /usr/local/lib/python3.7/dist-packages (from hdbscan>=0.8.27->bertopic) (0.29.22)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=1.1.5->bertopic) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=1.1.5->bertopic) (2018.9)
Requirement already satisfied: numba>=0.49 in /usr/local/lib/python3.7/dist-packages (from umap-learn>=0.5.0->bertopic) (0.51.2)
Requirement already satisfied: pynndescent>=0.5 in /usr/local/lib/python3.7/dist-packages (from umap-learn>=0.5.0->bertopic) (0.5.2)
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic) (3.0.12)
Collecting tokenizers<0.11,>=0.10.1
  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
     |████████████████████████████████| 3.2MB 56.3MB/s 
Requirement already satisfied: importlib-metadata; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic) (3.7.2)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic) (20.9)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic) (2019.12.20)
Collecting sacremoses
  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
     |████████████████████████████████| 890kB 19.6MB/s 
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic) (2.23.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn>=0.5.0->bertopic) (54.0.0)
Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn>=0.5.0->bertopic) (0.34.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata; python_version < "3.8"->transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic) (3.4.1)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic) (2.4.7)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic) (7.1.2)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic) (2020.12.5)
Building wheels for collected packages: sentence-transformers, sacremoses
  Building wheel for sentence-transformers (setup.py) ... done
  Created wheel for sentence-transformers: filename=sentence_transformers-0.4.1.2-cp37-none-any.whl size=103068 sha256=ac7867d3f19e9900dc90ced1f04a590297eda3cf84afd6199d87c343ff5226ec
  Stored in directory: /root/.cache/pip/wheels/3d/33/d1/5703dd56199c09d4a1b41e0c07fb4e7765a84d787cbdc48ac3
  Building wheel for sacremoses (setup.py) ... done
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=0bdf44f041bfe8537aa5bbb46b0f7227615ebbadeac7bded6fa45619b7c1fc20
  Stored in directory: /root/.cache/pip/wheels/29/3c/fd/7ce5c3f0666dab31a50123635e6fb5e19ceb42ce38d4e58f45
Successfully built sentence-transformers sacremoses
Installing collected packages: tokenizers, sacremoses, transformers, sentencepiece, sentence-transformers, bertopic
Successfully installed bertopic-0.6.0 sacremoses-0.0.43 sentence-transformers-0.4.1.2 sentencepiece-0.1.95 tokenizers-0.10.1 transformers-4.4.1
Requirement already satisfied: bertopic[visualization] in /usr/local/lib/python3.7/dist-packages (0.6.0)
Requirement already satisfied: sentence-transformers>=0.4.1 in /usr/local/lib/python3.7/dist-packages (from bertopic[visualization]) (0.4.1.2)
Requirement already satisfied: scikit-learn>=0.22.2.post1 in /usr/local/lib/python3.7/dist-packages (from bertopic[visualization]) (0.22.2.post1)
Requirement already satisfied: pandas>=1.1.5 in /usr/local/lib/python3.7/dist-packages (from bertopic[visualization]) (1.1.5)
Requirement already satisfied: umap-learn>=0.5.0 in /usr/local/lib/python3.7/dist-packages (from bertopic[visualization]) (0.5.1)
Requirement already satisfied: numpy>=1.19.2 in /usr/local/lib/python3.7/dist-packages (from bertopic[visualization]) (1.20.1)
Requirement already satisfied: hdbscan>=0.8.27 in /usr/local/lib/python3.7/dist-packages (from bertopic[visualization]) (0.8.27)
Requirement already satisfied: torch>=1.4.0 in /usr/local/lib/python3.7/dist-packages (from bertopic[visualization]) (1.8.0+cu101)
Requirement already satisfied: tqdm>=4.41.1 in /usr/local/lib/python3.7/dist-packages (from bertopic[visualization]) (4.41.1)
Requirement already satisfied: matplotlib>=3.2.2; extra == "visualization" in /usr/local/lib/python3.7/dist-packages (from bertopic[visualization]) (3.2.2)
Collecting plotly<4.14.3,>=4.7.0; extra == "visualization"
  Downloading https://files.pythonhosted.org/packages/9d/2e/69579c3db25fa4f85d70a10f8a98d52c2b4a0dcbd153e8f17f425761bef4/plotly-4.14.2-py2.py3-none-any.whl (13.2MB)
     |████████████████████████████████| 13.2MB 247kB/s 
Requirement already satisfied: sentencepiece in /usr/local/lib/python3.7/dist-packages (from sentence-transformers>=0.4.1->bertopic[visualization]) (0.1.95)
Requirement already satisfied: nltk in /usr/local/lib/python3.7/dist-packages (from sentence-transformers>=0.4.1->bertopic[visualization]) (3.2.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from sentence-transformers>=0.4.1->bertopic[visualization]) (1.4.1)
Requirement already satisfied: transformers<5.0.0,>=3.1.0 in /usr/local/lib/python3.7/dist-packages (from sentence-transformers>=0.4.1->bertopic[visualization]) (4.4.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.22.2.post1->bertopic[visualization]) (1.0.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=1.1.5->bertopic[visualization]) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=1.1.5->bertopic[visualization]) (2.8.1)
Requirement already satisfied: pynndescent>=0.5 in /usr/local/lib/python3.7/dist-packages (from umap-learn>=0.5.0->bertopic[visualization]) (0.5.2)
Requirement already satisfied: numba>=0.49 in /usr/local/lib/python3.7/dist-packages (from umap-learn>=0.5.0->bertopic[visualization]) (0.51.2)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from hdbscan>=0.8.27->bertopic[visualization]) (1.15.0)
Requirement already satisfied: cython>=0.27 in /usr/local/lib/python3.7/dist-packages (from hdbscan>=0.8.27->bertopic[visualization]) (0.29.22)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from torch>=1.4.0->bertopic[visualization]) (3.7.4.3)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.2.2; extra == "visualization"->bertopic[visualization]) (1.3.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.2.2; extra == "visualization"->bertopic[visualization]) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.2.2; extra == "visualization"->bertopic[visualization]) (0.10.0)
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.7/dist-packages (from plotly<4.14.3,>=4.7.0; extra == "visualization"->bertopic[visualization]) (1.3.3)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic[visualization]) (20.9)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic[visualization]) (2019.12.20)
Requirement already satisfied: tokenizers<0.11,>=0.10.1 in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic[visualization]) (0.10.1)
Requirement already satisfied: sacremoses in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic[visualization]) (0.0.43)
Requirement already satisfied: importlib-metadata; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic[visualization]) (3.7.2)
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic[visualization]) (3.0.12)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic[visualization]) (2.23.0)
Requirement already satisfied: llvmlite>=0.30 in /usr/local/lib/python3.7/dist-packages (from pynndescent>=0.5->umap-learn>=0.5.0->bertopic[visualization]) (0.34.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn>=0.5.0->bertopic[visualization]) (54.0.0)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic[visualization]) (7.1.2)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata; python_version < "3.8"->transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic[visualization]) (3.4.1)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic[visualization]) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic[visualization]) (2020.12.5)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic[visualization]) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->transformers<5.0.0,>=3.1.0->sentence-transformers>=0.4.1->bertopic[visualization]) (1.24.3)
Installing collected packages: plotly
  Found existing installation: plotly 4.4.1
    Uninstalling plotly-4.4.1:
      Successfully uninstalled plotly-4.4.1
Successfully installed plotly-4.14.2

Importing Libraries

In [ ]:
from google.colab import drive

import pandas as pd
import numpy as np
import re
import json
import copy
import pickle
from scipy.cluster.hierarchy import linkage, dendrogram
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from top2vec import Top2Vec
from bertopic import BERTopic

# importing visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

# importing scikitlearn modules
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KernelDensity
from sklearn.decomposition import TruncatedSVD, NMF, PCA, LatentDirichletAllocation
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans, DBSCAN
from sklearn.manifold import TSNE
from sklearn.feature_extraction import text
from sklearn.preprocessing import normalize, LabelEncoder

# importing PySpark
from pyspark.ml import Pipeline
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.clustering import LDA
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

# importing Spark NLP 
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.embeddings import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp
In [ ]:
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
Out[ ]:
True

Data Paths

In [ ]:
root_path = '/content/drive/Shareddrives/Data Mining Group Project/'
output_path = '/content/drive/Shareddrives/Data Mining Group Project/presentation/'
In [ ]:
# drive.flush_and_unmount()
drive.mount('/content/drive')
In [ ]:
# for GPU training >> sparknlp.start(gpu=True)
spark = sparknlp.start(gpu = False)

print("Apache Spark version:", spark.version)
print("Spark NLP version", sparknlp.version())
Apache Spark version: 2.4.4
Spark NLP version 2.7.5
In [ ]:
# reading in data for each job type
da_data = pd.read_csv(root_path + 'data/DataAnalyst.csv')
ds_data = pd.read_csv(root_path + 'data/DataScientist.csv')
ba_data = pd.read_csv(root_path + 'data/BusinessAnalyst.csv')
In [ ]:
print(da_data.info())
da_data.head(3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2253 entries, 0 to 2252
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         2253 non-null   int64  
 1   Job Title          2253 non-null   object 
 2   Salary Estimate    2253 non-null   object 
 3   Job Description    2253 non-null   object 
 4   Rating             2253 non-null   float64
 5   Company Name       2252 non-null   object 
 6   Location           2253 non-null   object 
 7   Headquarters       2253 non-null   object 
 8   Size               2253 non-null   object 
 9   Founded            2253 non-null   int64  
 10  Type of ownership  2253 non-null   object 
 11  Industry           2253 non-null   object 
 12  Sector             2253 non-null   object 
 13  Revenue            2253 non-null   object 
 14  Competitors        2253 non-null   object 
 15  Easy Apply         2253 non-null   object 
dtypes: float64(1), int64(2), object(13)
memory usage: 281.8+ KB
None
Out[ ]:
Unnamed: 0 Job Title Salary Estimate Job Description Rating Company Name Location Headquarters Size Founded Type of ownership Industry Sector Revenue Competitors Easy Apply
0 0 Data Analyst, Center on Immigration and Justic... $37K-$66K (Glassdoor est.) Are you eager to roll up your sleeves and harn... 3.2 Vera Institute of Justice\n3.2 New York, NY New York, NY 201 to 500 employees 1961 Nonprofit Organization Social Assistance Non-Profit $100 to $500 million (USD) -1 True
1 1 Quality Data Analyst $37K-$66K (Glassdoor est.) Overview\n\nProvides analytical and technical ... 3.8 Visiting Nurse Service of New York\n3.8 New York, NY New York, NY 10000+ employees 1893 Nonprofit Organization Health Care Services & Hospitals Health Care $2 to $5 billion (USD) -1 -1
2 2 Senior Data Analyst, Insights & Analytics Team... $37K-$66K (Glassdoor est.) We’re looking for a Senior Data Analyst who ha... 3.4 Squarespace\n3.4 New York, NY New York, NY 1001 to 5000 employees 2003 Company - Private Internet Information Technology Unknown / Non-Applicable GoDaddy -1
In [ ]:
print(ds_data.info())
ds_data.head(3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3909 entries, 0 to 3908
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         3909 non-null   int64  
 1   index              3909 non-null   int64  
 2   Job Title          3909 non-null   object 
 3   Salary Estimate    3909 non-null   object 
 4   Job Description    3909 non-null   object 
 5   Rating             3909 non-null   float64
 6   Company Name       3909 non-null   object 
 7   Location           3909 non-null   object 
 8   Headquarters       3909 non-null   object 
 9   Size               3909 non-null   object 
 10  Founded            3909 non-null   int64  
 11  Type of ownership  3909 non-null   object 
 12  Industry           3909 non-null   object 
 13  Sector             3909 non-null   object 
 14  Revenue            3909 non-null   object 
 15  Competitors        3909 non-null   object 
 16  Easy Apply         3909 non-null   object 
dtypes: float64(1), int64(3), object(13)
memory usage: 519.3+ KB
None
Out[ ]:
Unnamed: 0 index Job Title Salary Estimate Job Description Rating Company Name Location Headquarters Size Founded Type of ownership Industry Sector Revenue Competitors Easy Apply
0 0 0 Senior Data Scientist $111K-$181K (Glassdoor est.) ABOUT HOPPER\n\nAt Hopper, we’re on a mission ... 3.5 Hopper\n3.5 New York, NY Montreal, Canada 501 to 1000 employees 2007 Company - Private Travel Agencies Travel & Tourism Unknown / Non-Applicable -1 -1
1 1 1 Data Scientist, Product Analytics $111K-$181K (Glassdoor est.) At Noom, we use scientifically proven methods ... 4.5 Noom US\n4.5 New York, NY New York, NY 1001 to 5000 employees 2008 Company - Private Health, Beauty, & Fitness Consumer Services Unknown / Non-Applicable -1 -1
2 2 2 Data Science Manager $111K-$181K (Glassdoor est.) Decode_M\n\nhttps://www.decode-m.com/\n\nData ... -1.0 Decode_M New York, NY New York, NY 1 to 50 employees -1 Unknown -1 -1 Unknown / Non-Applicable -1 True
In [ ]:
print(ba_data.info())
ba_data.head(3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4092 entries, 0 to 4091
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Unnamed: 0         4092 non-null   object
 1   index              4092 non-null   object
 2   Job Title          4092 non-null   object
 3   Salary Estimate    4092 non-null   object
 4   Job Description    4092 non-null   object
 5   Rating             4092 non-null   object
 6   Company Name       4092 non-null   object
 7   Location           4092 non-null   object
 8   Headquarters       4092 non-null   object
 9   Size               4092 non-null   object
 10  Founded            4092 non-null   object
 11  Type of ownership  4092 non-null   object
 12  Industry           4092 non-null   object
 13  Sector             4092 non-null   object
 14  Revenue            4092 non-null   object
 15  Competitors        3692 non-null   object
 16  Easy Apply         3692 non-null   object
dtypes: object(17)
memory usage: 543.6+ KB
None
Out[ ]:
Unnamed: 0 index Job Title Salary Estimate Job Description Rating Company Name Location Headquarters Size Founded Type of ownership Industry Sector Revenue Competitors Easy Apply
0 0 0 Business Analyst - Clinical & Logistics Platform $56K-$102K (Glassdoor est.) Company Overview\n\n\nAt Memorial Sloan Ketter... 3.9 Memorial Sloan-Kettering\n3.9 New York, NY New York, NY 10000+ employees 1884 Nonprofit Organization Health Care Services & Hospitals Health Care $2 to $5 billion (USD) Mayo Clinic, The Johns Hopkins Hospital, MD An... -1
1 1 1 Business Analyst $56K-$102K (Glassdoor est.) We are seeking for an energetic and collaborat... 3.8 Paine Schwartz Partners\n3.8 New York, NY New York, NY 1 to 50 employees -1 Company - Private Venture Capital & Private Equity Finance Unknown / Non-Applicable -1 True
2 2 2 Data Analyst $56K-$102K (Glassdoor est.) For more than a decade, Asembia has been worki... 3.6 Asembia\n3.6 Florham Park, NJ Florham Park, NJ 501 to 1000 employees 2004 Company - Private Biotech & Pharmaceuticals Biotech & Pharmaceuticals $5 to $10 million (USD) -1 -1

Data Cleaning

Addressing Extra Columns

In [ ]:
# first two columns of business analyst dataset are type object whereas
# other datasets are of type int, suggesting there are might be text values.
# searching for rows with text shows that columns for business dataset are
# shifted to the left by 2 (last two columns as NaNs)
ba_data[ba_data['Unnamed: 0'].str.contains('[a-zA-Z]', regex=True)].tail(3)
Out[ ]:
Unnamed: 0 index Job Title Salary Estimate Job Description Rating Company Name Location Headquarters Size Founded Type of ownership Industry Sector Revenue Competitors Easy Apply
4089 Programmer Analyst- PeopleSoft ( Finance and S... $66K-$114K (Glassdoor est.) Job Opening Summary\nReports to the Systems Ap... 4.0 Shands at the University of Florida\n4.0 Jacksonville, FL Gainesville, FL 10000+ employees -1 Subsidiary or Business Segment Health Care Services & Hospitals Health Care $1 to $2 billion (USD) Mount Sinai Medical Center of Florida, Baptist... -1 NaN NaN
4090 Loss Mitigation Analyst $66K-$114K (Glassdoor est.) Job Description\nA knowledgeable job-seeker is... 4.4 Contemporary Staffing Solutions\n4.4 Jacksonville, FL Mount Laurel, NJ 1001 to 5000 employees 1994 Company - Private Staffing & Outsourcing Business Services $100 to $500 million (USD) PathFinder Staffing, Juno Search Partners, Rob... -1 NaN NaN
4091 Financial Analyst II - Baptist $66K-$114K (Glassdoor est.) Job Summary\n\nThis unique analyst position re... 2.7 Baptist Medical Center Jacksonville\n2.7 Jacksonville, FL Jacksonville, FL 5001 to 10000 employees -1 Hospital Health Care Services & Hospitals Health Care $1 to $2 billion (USD) -1 -1 NaN NaN
In [ ]:
# finding indices of rows that need to be shifted back and correcting
shifted_rows = ba_data.index[ba_data['Unnamed: 0'].str.contains('[a-zA-Z]+')]
ba_data.iloc[shifted_rows, 2:] = ba_data.iloc[shifted_rows, 0:-2].values

# dropping beginning columns after correction
ba_data = ba_data.iloc[:, 2:]

# beginning columns for other two datasets can be dropped as well
da_data = da_data.iloc[:, 1:]
ds_data = ds_data.iloc[:, 2:]

# printing out columns to inspect alignment
# notice that business analyst data types differ due to originally 
# shifted data
print(da_data.info())
print('\n')
print(ds_data.info())
print('\n')
print(ba_data.info())

# making sure shift worked
ba_data.tail(3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2253 entries, 0 to 2252
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Job Title          2253 non-null   object 
 1   Salary Estimate    2253 non-null   object 
 2   Job Description    2253 non-null   object 
 3   Rating             2253 non-null   float64
 4   Company Name       2252 non-null   object 
 5   Location           2253 non-null   object 
 6   Headquarters       2253 non-null   object 
 7   Size               2253 non-null   object 
 8   Founded            2253 non-null   int64  
 9   Type of ownership  2253 non-null   object 
 10  Industry           2253 non-null   object 
 11  Sector             2253 non-null   object 
 12  Revenue            2253 non-null   object 
 13  Competitors        2253 non-null   object 
 14  Easy Apply         2253 non-null   object 
dtypes: float64(1), int64(1), object(13)
memory usage: 264.1+ KB
None


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3909 entries, 0 to 3908
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Job Title          3909 non-null   object 
 1   Salary Estimate    3909 non-null   object 
 2   Job Description    3909 non-null   object 
 3   Rating             3909 non-null   float64
 4   Company Name       3909 non-null   object 
 5   Location           3909 non-null   object 
 6   Headquarters       3909 non-null   object 
 7   Size               3909 non-null   object 
 8   Founded            3909 non-null   int64  
 9   Type of ownership  3909 non-null   object 
 10  Industry           3909 non-null   object 
 11  Sector             3909 non-null   object 
 12  Revenue            3909 non-null   object 
 13  Competitors        3909 non-null   object 
 14  Easy Apply         3909 non-null   object 
dtypes: float64(1), int64(1), object(13)
memory usage: 458.2+ KB
None


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4092 entries, 0 to 4091
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Job Title          4092 non-null   object
 1   Salary Estimate    4092 non-null   object
 2   Job Description    4092 non-null   object
 3   Rating             4092 non-null   object
 4   Company Name       4092 non-null   object
 5   Location           4092 non-null   object
 6   Headquarters       4092 non-null   object
 7   Size               4092 non-null   object
 8   Founded            4092 non-null   object
 9   Type of ownership  4092 non-null   object
 10  Industry           4092 non-null   object
 11  Sector             4092 non-null   object
 12  Revenue            4092 non-null   object
 13  Competitors        4092 non-null   object
 14  Easy Apply         4092 non-null   object
dtypes: object(15)
memory usage: 479.7+ KB
None
Out[ ]:
Job Title Salary Estimate Job Description Rating Company Name Location Headquarters Size Founded Type of ownership Industry Sector Revenue Competitors Easy Apply
4089 Programmer Analyst- PeopleSoft ( Finance and S... $66K-$114K (Glassdoor est.) Job Opening Summary\nReports to the Systems Ap... 4.0 Shands at the University of Florida\n4.0 Jacksonville, FL Gainesville, FL 10000+ employees -1 Subsidiary or Business Segment Health Care Services & Hospitals Health Care $1 to $2 billion (USD) Mount Sinai Medical Center of Florida, Baptist... -1
4090 Loss Mitigation Analyst $66K-$114K (Glassdoor est.) Job Description\nA knowledgeable job-seeker is... 4.4 Contemporary Staffing Solutions\n4.4 Jacksonville, FL Mount Laurel, NJ 1001 to 5000 employees 1994 Company - Private Staffing & Outsourcing Business Services $100 to $500 million (USD) PathFinder Staffing, Juno Search Partners, Rob... -1
4091 Financial Analyst II - Baptist $66K-$114K (Glassdoor est.) Job Summary\n\nThis unique analyst position re... 2.7 Baptist Medical Center Jacksonville\n2.7 Jacksonville, FL Jacksonville, FL 5001 to 10000 employees -1 Hospital Health Care Services & Hospitals Health Care $1 to $2 billion (USD) -1 -1

Combining Datasets

In [ ]:
# creating a new column in each dataset indicating job type
da_data['Job Type'] = 'Data Analyst'
ba_data['Job Type'] = 'Business Analyst'
ds_data['Job Type'] = 'Data Scientist'

# business analyst dataset has all columns as type obj, which will result
# in combined dataframe to type cast all columns to object type
# we want appropriate data types so we create a data type dictionary by
# using data analyst types (could also use data scientist types)
type_dict = dict(zip(da_data.columns, da_data.dtypes.tolist()))

# combining all three datasets into one to continue data cleanup
# as well as converting dtypes after concatenation
data = pd.concat([da_data, ba_data, ds_data], ignore_index=True).astype(type_dict)

# cleaning up column names so data is easier to work with
data.columns = da_data.columns.str.title().str.replace(' ', '')

# displaying
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10254 entries, 0 to 10253
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   JobTitle         10254 non-null  object 
 1   SalaryEstimate   10254 non-null  object 
 2   JobDescription   10254 non-null  object 
 3   Rating           10254 non-null  float64
 4   CompanyName      10253 non-null  object 
 5   Location         10254 non-null  object 
 6   Headquarters     10254 non-null  object 
 7   Size             10254 non-null  object 
 8   Founded          10254 non-null  int64  
 9   TypeOfOwnership  10254 non-null  object 
 10  Industry         10254 non-null  object 
 11  Sector           10254 non-null  object 
 12  Revenue          10254 non-null  object 
 13  Competitors      10254 non-null  object 
 14  EasyApply        10254 non-null  object 
 15  JobType          10254 non-null  object 
dtypes: float64(1), int64(1), object(14)
memory usage: 1.3+ MB

Imputing Founded Year

In [ ]:
# will be using kde to estimate values
# want to find the distribution of company founded years
# to do this we drop missing values so they don't skew distribution
training_years = data[data['Founded'] != -1]['Founded'].to_numpy().reshape(-1, 1)

# number of missing values and their indexes
missing_idx = data[data['Founded'] == -1].index
n_missing = len(missing_idx)

# # using grid search to find the optimal bandwith for kde
# # to skip wait, use bandwidth = 1.85
# bandwidth = np.linspace(0.01, 2.00, 50)
kde = KernelDensity(kernel='gaussian', bandwidth=1.85)
kde.fit(training_years) # comment this line out when doing grid search
# grid = GridSearchCV(kde, {'bandwidth': bandwidth})
# grid.fit(training_years)

# # updating kde object to now use the estimator determined through grid search
# kde = grid.best_estimator_

# using optimized kde to generate random samples
kde_sample = kde.sample(n_samples=n_missing, random_state=1)

# function to scale values into a predefined range
# needed becase KDE can produce values outside of obesrved values
# for example, we can see 2023 as a year
def rescale(value, s_min, s_max, t_min, t_max):
  return (value - s_min) / (s_max - s_min) * (t_max - t_min) + t_min
  
# rescaling generated values to make sure nothing falls outside of existing range
kde_sample_rescaled = np.array([round(rescale(year, kde_sample.min(), kde_sample.max(), training_years.min(), training_years.max())) for year in kde_sample.flatten()])

# replacing missing values with the random data points
data.loc[missing_idx, 'Founded'] = kde_sample_rescaled

# converting founded year to company age
# and then dropping original column
data['OrganizationAge'] = 2021 - data['Founded']
data = data.drop(columns=['Founded'])

Imputing Rating

In [ ]:
# want to find the distribution of ratings
# to do this we drop missing values so they don't skew distribution
training_ratings = data[data['Rating'] != -1]['Rating'].to_numpy().reshape(-1, 1)

# number of missing values and their indexes
missing_idx = data[data['Rating'] == -1].index
n_missing = len(missing_idx)

# # using grid search to find the optimal bandwith for kde
# # to skip wait, use bandwidth = 0.01
# bandwidth = np.linspace(0.01, 2.00, 50)
kde = KernelDensity(kernel='gaussian', bandwidth=0.01)
kde.fit(training_ratings) 
# grid = GridSearchCV(kde, {'bandwidth': bandwidth})
# grid.fit(training_ratings)

# # updating kde object to now use the estimator determined through grid search
# kde = grid.best_estimator_

# using optimized kde to generate random samples
kde_sample = kde.sample(n_samples=n_missing, random_state=1)
  
# rescaling generated values to make sure nothing falls outside of existing range
# using the same function as defined previously
kde_sample_rescaled = np.array([round(rescale(rating, kde_sample.min(), kde_sample.max(), training_ratings.min(), training_ratings.max()), 1) for rating in kde_sample.flatten()])

# replacing missing values with the random data points
data.loc[missing_idx, 'Rating'] = kde_sample_rescaled

Miscellaneous

In [ ]:
# cleaning up job description by replacing new line characters
data['JobDescription'] = data['JobDescription'].str.replace('\n', ' ', regex=True)

# company name column has some ratings in the name that are stragglers
# we want to remove these
data['CompanyName'] = data['CompanyName'].str.replace('\n\d\.\d$', '', regex=True)

# dropping columns that aren't needed for analysis
# revenue data has a majority of missing values
data = data.drop(columns=['EasyApply', 'Competitors', 'Revenue'])

Feature Engineering

Extracting Salary Information

In [ ]:
# salary column contains both salaried and hourly compensation info
hourly_mask = data['SalaryEstimate'].str.contains('Per Hour')

# we can see salary values are strings from which we can extract float values
print('Example of hourly syntax:', data[hourly_mask].iloc[0, 1])
print('Example of salary syntax:', data[~hourly_mask].iloc[0, 1])
Example of hourly syntax: $34-$53 Per Hour(Glassdoor est.)
Example of salary syntax: $37K-$66K (Glassdoor est.)
In [ ]:
# using a regex pattern with capture groups to extract numbers
# first group captures first value, second group for second value
salary_pattern = '\$*(\d+)[kK]*-\$*(\d+)[kK]*.*'

# creating a two-column df frame, one column for each capture group
# one column serves as lower bound, other column serves as upper bound
salary_df = (data['SalaryEstimate']
             .str.replace('\s', '', regex=True)
             .str.extract(salary_pattern)
             .rename(columns={0: 'SalaryLower', 1: 'SalaryUpper'})
             .astype('float')
            )

# displaying df
salary_df.head(3)
Out[ ]:
SalaryLower SalaryUpper
0 37.0 66.0
1 37.0 66.0
2 37.0 66.0
In [ ]:
# currently, salary bounds are either in units of $/hr or $k/year and
# we want to convert to annual rates
# hourly rates should by multiplied by work hours in a year
# salaried rates should by multiplied by 1k
salary_df.loc[hourly_mask, :] = salary_df.loc[hourly_mask, :].values * 2_080
salary_df.loc[~hourly_mask, :] = salary_df.loc[~hourly_mask, :].values * 1_000
salary_df['SalaryAvg'] = (salary_df['SalaryLower'] + salary_df['SalaryUpper']) // 2

# combining the salary column with the existing dataset
# and dropping original columns
data = pd.concat([data, salary_df], axis=1).drop(columns=['SalaryEstimate'])

# displaying resulting df
data.head(3)
Out[ ]:
JobTitle JobDescription Rating CompanyName Location Headquarters Size TypeOfOwnership Industry Sector JobType OrganizationAge SalaryLower SalaryUpper SalaryAvg
0 Data Analyst, Center on Immigration and Justic... Are you eager to roll up your sleeves and harn... 3.2 Vera Institute of Justice New York, NY New York, NY 201 to 500 employees Nonprofit Organization Social Assistance Non-Profit Data Analyst 60 37000.0 66000.0 51500.0
1 Quality Data Analyst Overview Provides analytical and technical su... 3.8 Visiting Nurse Service of New York New York, NY New York, NY 10000+ employees Nonprofit Organization Health Care Services & Hospitals Health Care Data Analyst 128 37000.0 66000.0 51500.0
2 Senior Data Analyst, Insights & Analytics Team... We’re looking for a Senior Data Analyst who ha... 3.4 Squarespace New York, NY New York, NY 1001 to 5000 employees Company - Private Internet Information Technology Data Analyst 18 37000.0 66000.0 51500.0

Categorical Columns

In [ ]:
# creating categorical variable column for type of ownership
# ownership type can be reduced to fewer categories
# value counts are omitted
ownership_map = {
    'Company - Private': 1,
    'Company - Public': 1,
    '-1': 1,
    'Nonprofit Organization': 0,
    'Subsidiary or Business Segment': 1,
    'Government': 0,
    'College / University': 0,
    'Unknown': 1,
    'Hospital': 0,
    'Contract': 1,
    'Other Organization': 0,
    'Private Practice / Firm': 1,
    'School / School District': 0,
    'Self-employed': 0,
    'Franchise': 1
}
data['IsBusiness'] = data['TypeOfOwnership'].map(ownership_map)

# dropping the column that won't be needed anymore
data = data.drop(columns=['TypeOfOwnership'])
In [ ]:
# creating categorical variable column for size of company
size_map = {
    '1 to 50 employees': 0,
    '51 to 200 employees': 1,
    '201 to 500 employees': 2,
    '501 to 1000 employees': 3,
    '1001 to 5000 employees': 4,
    '5001 to 10000 employees': 5,
    '10000+ employees': 6,
    -1: None,
    'Unknwon': None
    }

# replacing current value columns with values as defined by map
data['Size'] = data['Size'].map(size_map)

# we want to impute missing values and will do so by looking at the sampling
# distribution of existing values
# we need to grab the indices of the observation with missing values (to fill them)
# and also the number of missing values to generate a random sample of that size
missing_idx = data[data['Size'].isna()].index
n_missing = len(missing_idx)

# generating random values based on distribution of non-null values
# distribution is just a simple frequency value for each class
# important to sort value counts not by count but by their value (0 - 6)
# so we can use np.random.choice to assign element-wise probability values
size_counts = data[data['Size'].notna()]['Size'].value_counts(dropna=False).sort_index().values
size_freq = size_counts / size_counts.sum()
rand_sample = np.random.choice(range(0, 7), size=n_missing, p=size_freq)

# assigning random sizes to missing values
data.loc[missing_idx, 'Size'] = rand_sample
In [ ]:
data.head(3)
Out[ ]:
JobTitle JobDescription Rating CompanyName Location Headquarters Size Industry Sector JobType OrganizationAge SalaryLower SalaryUpper SalaryAvg IsBusiness
0 Data Analyst, Center on Immigration and Justic... Are you eager to roll up your sleeves and harn... 3.2 Vera Institute of Justice New York, NY New York, NY 2.0 Social Assistance Non-Profit Data Analyst 60 37000.0 66000.0 51500.0 0
1 Quality Data Analyst Overview Provides analytical and technical su... 3.8 Visiting Nurse Service of New York New York, NY New York, NY 6.0 Health Care Services & Hospitals Health Care Data Analyst 128 37000.0 66000.0 51500.0 0
2 Senior Data Analyst, Insights & Analytics Team... We’re looking for a Senior Data Analyst who ha... 3.4 Squarespace New York, NY New York, NY 4.0 Internet Information Technology Data Analyst 18 37000.0 66000.0 51500.0 1

Saving/Loading Clean Dataset

In [ ]:
# saving the cleaned dataset to the GDrive
data.to_csv(root_path + '/data/clean_data.csv')
In [ ]:
# reading the cleaned dataset from GDrive
data = pd.read_csv(root_path + '/data/clean_data.csv')

Defining and Fitting Spark Pipelines

In [ ]:
# creating the Spark dataframe for JobTitle and JobDescription
spark_df = spark.createDataFrame(data[['JobTitle', 'JobDescription']])
spark_df.show()
+--------------------+--------------------+
|            JobTitle|      JobDescription|
+--------------------+--------------------+
|Data Analyst, Cen...|Are you eager to ...|
|Quality Data Analyst|Overview  Provide...|
|Senior Data Analy...|We’re looking for...|
|        Data Analyst|Requisition Numbe...|
|Reporting Data An...|ABOUT FANDUEL GRO...|
|        Data Analyst|About Cubist Cubi...|
|Business/Data Ana...|Two Sigma is a di...|
|Data Science Analyst|Data Science Anal...|
|        Data Analyst|The Data Analyst ...|
|Data Analyst, Mer...|About Us  Riskifi...|
|        Data Analyst|NYU Grossman Scho...|
|        Data Analyst|BulbHead is curre...|
|        DATA ANALYST|Job Summary:  The...|
| Senior Data Analyst|About Known  Know...|
|Investment Adviso...|Investment Adviso...|
|Sustainability Da...|Job Description R...|
|        Data Analyst|Undertone stands ...|
|Clinical Data Ana...|About Us:  NYSTEC...|
|DATA PROGRAMMER/A...|Company Descripti...|
|        Data Analyst|About Us  At Teac...|
+--------------------+--------------------+
only showing top 20 rows

In [ ]:
# Spark NLP requires the input dataframe or column to be converted to document
documentAssemblerJobTitle = DocumentAssembler()\
  .setInputCol("JobTitle")\
  .setOutputCol('document')\
  .setCleanupMode('disabled')

documentAssemblerJobDescription = DocumentAssembler()\
  .setInputCol("JobDescription")\
  .setOutputCol('document')\
  .setCleanupMode('disabled')

# Split the document into sentences
sentencerDL = SentenceDetectorDLModel.pretrained(name = "sentence_detector_dl", 
                                                 lang = "en")\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

# Split sentence to tokens
tokenizer = Tokenizer()\
  .setInputCols(['sentence'])\
  .setOutputCol('token')\
  .setCaseSensitiveExceptions(False)

# Clean unwanted characters
normalizer = Normalizer()\
  .setInputCols(["token"])\
  .setOutputCol("normalizedToken")\
  .setCleanupPatterns(["[^\w\d\s]"])\
  .setLowercase(True)

# Remove stopwords
stopwordsCleaner = StopWordsCleaner()\
  .setInputCols(["normalizedToken"])\
  .setOutputCol("cleanToken")\
  .setCaseSensitive(False)\
  .setLazyAnnotator(False)

# Apply spell checking
spellChecker = NorvigSweetingModel.pretrained()\
  .setInputCols(["cleanToken"])\
  .setOutputCol("checkedToken")\
  .setLazyAnnotator(False)

# Stems tokens
stemmer = Stemmer()\
  .setInputCols(["checkedToken"])\
  .setOutputCol("stemToken")\
  .setLanguage('English')\
  .setLazyAnnotator(False)

# Lemmatizes tokens
lemmatizer = LemmatizerModel.pretrained(name = "lemma_antbnc", lang = "en")\
  .setInputCols(['checkedToken'])\
  .setOutputCol('lemmaToken')\
  .setLazyAnnotator(False)

# Assembles tokens into documents
tokenAssemblerStem = TokenAssembler()\
    .setInputCols(["document", "stemToken"])\
    .setOutputCol("assembledStem")

# Assembles tokens into documents
tokenAssemblerLemma = TokenAssembler()\
    .setInputCols(["document", "lemmaToken"])\
    .setOutputCol("assembledLemma")

# Finisher helps us to bring back the expected structure array of the tokens
finisher = Finisher()\
  .setInputCols(["assembledLemma",
                 "lemmaToken"])\
  .setOutputCols(["assembledLemma",
                  "lemmaToken"])\
  .setCleanAnnotations(True)\
  .setIncludeMetadata(False)\
  .setOutputAsArray(True)

# Organizes the pipeline stages
processedPipelineJobTitle = Pipeline()\
  .setStages([documentAssemblerJobTitle,                  
              sentencerDL,
              tokenizer,
              normalizer,
              stopwordsCleaner,
              spellChecker,
              stemmer,
              lemmatizer,
              tokenAssemblerStem,
              tokenAssemblerLemma])

processedPipelineJobDescription = Pipeline()\
  .setStages([documentAssemblerJobDescription,                  
              sentencerDL,
              tokenizer,
              normalizer,
              stopwordsCleaner,
              spellChecker,
              stemmer,
              lemmatizer,
              tokenAssemblerStem,
              tokenAssemblerLemma])
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
spellcheck_norvig download started this may take some time.
Approximate size to download 4.2 MB
[OK!]
lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[OK!]
In [ ]:
# Fitting the pipelines
processedSparkJobTitle = processedPipelineJobTitle.fit(spark_df).transform(spark_df)
processedSparkJobDescription = processedPipelineJobDescription.fit(spark_df).transform(spark_df)
processedSparkJobDescription.printSchema()
root
 |-- JobTitle: string (nullable = true)
 |-- JobDescription: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- sentence: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- token: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- normalizedToken: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- cleanToken: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- checkedToken: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- stemToken: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- lemmaToken: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- assembledStem: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- assembledLemma: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)

In [ ]:
# extracting the lemmatized and stemmed tokens from the spark dataframe
# processed_JobTitle = processedSparkJobTitle.select(F.explode(F.arrays_zip('document.result',
#                                                                           'assembledLemma.result',
#                                                                           'assembledStem.result')).alias("cols")) \
#   .select(F.expr("cols['0']").alias("JobTitle"),
#           F.expr("cols['1']").alias("JobTitle_lemmas"),
#           F.expr("cols['2']").alias("JobTitle_stems")).toPandas()

# processed_JobTitle.head()
In [ ]:
# extracting the lemmatized and stemmed tokens from the spark dataframe
# processed_JobDescription = processedSparkJobDescription.select(F.explode(F.arrays_zip('document.result',
#                                                                                       'assembledLemma.result',
#                                                                                       'assembledStem.result')).alias("cols")) \
#   .select(F.expr("cols['0']").alias("JobDescription"),
#           F.expr("cols['1']").alias("JobDescription_lemmas"),
#           F.expr("cols['2']").alias("JobDescription_stems")).toPandas()

# processed_JobDescription.head()
In [ ]:
# merging the processed JobTitle and JobDescription data frames
# processed_text = pd.merge(processed_JobTitle, 
#                           processed_JobDescription,
#                           left_index = True, 
#                           right_index = True)
In [ ]:
# saving the processed text columns to GDrive
# processed_text.to_csv(root_path + '/data/processed_text.csv')
In [ ]:
# loading the processed text columns from GDrive
processed_text = pd.read_csv(root_path + '/data/processed_text.csv')
processed_text.head()
In [ ]:
# extracting only the lemmas
# JobTitle_lemmas = processedSparkJobTitle.select(F.explode(F.arrays_zip('lemmaToken.result')).alias("cols")) \
#   .select(F.expr("cols['0']").alias("JobTitle_lemmas")).toPandas()
# JobDescription_lemmas = processedSparkJobDescription.select(F.explode(F.arrays_zip('lemmaToken.result')).alias("cols")) \
#   .select(F.expr("cols['0']").alias("JobDescription_lemmas")).toPandas()

# # extracting only the stems
# JobTitle_stems = processedSparkJobTitle.select(F.explode(F.arrays_zip('stemToken.result')).alias("cols")) \
#   .select(F.expr("cols['0']").alias("JobTitle_stems")).toPandas()
# JobDescription_stems = processedSparkJobDescription.select(F.explode(F.arrays_zip('stemToken.result')).alias("cols")) \
#   .select(F.expr("cols['0']").alias("JobDescription_stems")).toPandas()
In [ ]:
# saving the lemmas
# JobTitle_lemmas.to_csv(root_path + '/data/JobTitle_lemmas.csv')
# JobDescription_lemmas.to_csv(root_path + '/data/JobDescription_lemmas.csv')

# saving the stems
# JobTitle_stems.to_csv(root_path + '/data/JobTitle_stems.csv')
# JobDescription_stems.to_csv(root_path + '/data/JobDescription_stems.csv')
In [ ]:
# reading the lemmas and stems
JobTitle_lemmas = pd.read_csv(root_path + '/data/JobTitle_lemmas.csv')
JobDescription_lemmas = pd.read_csv(root_path + '/data/JobDescription_lemmas.csv')
JobTitle_stems = pd.read_csv(root_path + '/data/JobTitle_stems.csv')
JobDescription_stems = pd.read_csv(root_path + '/data/JobDescription_stems.csv')

Defining Training and Testing Sets

In [ ]:
# split data into features and target variable
y = data['JobType']
X = data.drop('JobType', axis = 1)
In [ ]:
# split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.2, 
                                                    random_state = 1)

Exploratory Data Analysis

Non-Job Description - Various Methods

In [ ]:
# looking at the salary distribution of each job type
salary_plot = sns.displot(x='SalaryAvg', data=data, hue='JobType', kde=True, aspect=1.5)

# saving the figure
salary_fig = salary_plot.fig
salary_fig.savefig(output_path + 'salary_hist_kf.png') 
In [ ]:
# plotting the distribution of rating 
rating_plot = sns.displot(x='Rating', data=data, hue='JobType', kde=True, aspect=1.5)

# saving the figure
rating_fig = rating_plot.fig
rating_fig.savefig(output_path + 'rating_hist_kf.png') 
In [ ]:
# plotting the rating distribution
size_plot = sns.countplot(x='Size', hue='JobType', saturation=0.5, data=data)

# saving the figure
size_fig = size_plot.get_figure()
size_fig.savefig(output_path + 'size_bar_kf.png') 
In [ ]:
# taking a look at the correlations between certain features
data[['Rating', 'SalaryAvg', 'OrganizationAge', 'Size', 'JobType']].replace({'JobType': {'Data Analyst': 0, 'Business Analyst': 1, 'Data Scientist': 2}}).corr()
Out[ ]:
Rating SalaryAvg OrganizationAge Size JobType
Rating 1.000000 0.059236 -0.114179 -0.192062 0.028877
SalaryAvg 0.059236 1.000000 -0.034585 0.014194 0.426619
OrganizationAge -0.114179 -0.034585 1.000000 0.318945 0.027096
Size -0.192062 0.014194 0.318945 1.000000 0.071762
JobType 0.028877 0.426619 0.027096 0.071762 1.000000

Job Description - Word Clouds

Data Scientist

In [ ]:
# creating data scientist mask
ds_mask = data['JobType'] == 'Data Scientist'

# Join the different lemmas together
ds_lemmas = " ".join(list(processed_text[ds_mask]['JobDescription_lemmas'].values))

# Create a WordCloud object
wordcloud = WordCloud(background_color = "white", 
                      max_words = 5000, 
                      contour_width = 3, 
                      contour_color = 'steelblue')

# Generate a word cloud
wordcloud.generate(ds_lemmas)

# Save the word cloud
wordcloud.to_file(output_path + "/ds_wordcloud.png")

# Visualize the word cloud
wordcloud.to_image()
Out[ ]:

Data Analyst

In [ ]:
# remove the rows that have NA as JobDescription 
processed_text['JobType'] = data['JobType']
processed_text.drop(processed_text[processed_text['JobDescription_lemmas'].isna()].index, inplace = True)
In [ ]:
# creating data analyst mask
da_mask = data['JobType'] == 'Data Analyst'

# Join the different lemmas together
da_lemmas = " ".join(list(processed_text[da_mask]['JobDescription_lemmas'].values))

# Create a WordCloud object
wordcloud = WordCloud(background_color = "white", 
                      max_words = 5000, 
                      contour_width = 3, 
                      contour_color = 'steelblue')

# Generate a word cloud
wordcloud.generate(da_lemmas)

# Save the word cloud
wordcloud.to_file(output_path + "/da_wordcloud.png")

# Visualize the word cloud
wordcloud.to_image()
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:5: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  """
Out[ ]:

Business Analyst

In [ ]:
# creating data analyst mask
ba_mask = data['JobType'] == 'Business Analyst'

# Join the different lemmas together
ba_lemmas = " ".join(list(processed_text[ba_mask]['JobDescription_lemmas'].values))

# Create a WordCloud object
wordcloud = WordCloud(background_color = "white", 
                      max_words = 5000, 
                      contour_width = 3, 
                      contour_color = 'steelblue')

# Generate a word cloud
wordcloud.generate(ba_lemmas)

# Save the word cloud
wordcloud.to_file(output_path + "/ba_wordcloud.png")

# Visualize the word cloud
wordcloud.to_image()
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:5: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  """
Out[ ]:

Job Description - Clustering

In [ ]:
#NMF: Making a UDF to run NMF
def nmf_function(num_components, doc_text_matrix, vectorizer):
    nmf = NMF(num_components)
    doc_topic = nmf.fit_transform(doc_text_matrix)
    
    index = []
    for i in range(num_components):
        index.append(i)
    topic_word = pd.DataFrame(nmf.components_.round(3),
             index = index,
             columns = vectorizer.get_feature_names())
    
    print(display_topics(nmf, vectorizer.get_feature_names(), 15))
In [ ]:
#Running tfidf on job descriptions
stop_words = text.ENGLISH_STOP_WORDS
cv_tfidf = TfidfVectorizer(stop_words=stop_words, min_df=0.1, max_df=0.7)
x_tfidf = cv_tfidf.fit_transform(data.JobDescription).toarray()
df_tfidf = pd.DataFrame(x_tfidf,columns=cv_tfidf.get_feature_names())
In [ ]:
job_titles = data['JobType'].values
In [ ]:
#run NMF on output of TFIDF

def nmf_HMatrix(num_components, doc_text_matrix, vectorizer):
    nmf = NMF(num_components)
    doc_topic = nmf.fit_transform(doc_text_matrix)
    
    idx = []
    for i in range(num_components):
        idx.append(i) 
    H = pd.DataFrame(doc_topic.round(3),
                    index = job_titles,
                    columns = idx)
    return H
In [ ]:
h9 = nmf_HMatrix(9,df_tfidf,cv_tfidf)
h9
Out[ ]:
0 1 2 3 4 5 6 7 8
Data Analyst 0.021 0.005 0.060 0.032 0.045 0.017 0.013 0.007 0.039
Data Analyst 0.015 0.000 0.000 0.085 0.070 0.000 0.000 0.000 0.056
Data Analyst 0.113 0.023 0.045 0.008 0.000 0.006 0.000 0.000 0.000
Data Analyst 0.027 0.007 0.075 0.017 0.025 0.066 0.014 0.016 0.010
Data Analyst 0.038 0.001 0.009 0.037 0.026 0.049 0.000 0.000 0.009
... ... ... ... ... ... ... ... ... ...
Data Scientist 0.000 0.006 0.000 0.017 0.000 0.074 0.046 0.000 0.011
Data Scientist 0.000 0.000 0.000 0.274 0.000 0.000 0.000 0.000 0.000
Data Scientist 0.009 0.042 0.062 0.038 0.045 0.016 0.014 0.026 0.000
Data Scientist 0.012 0.047 0.006 0.040 0.053 0.010 0.006 0.029 0.000
Data Scientist 0.011 0.007 0.008 0.000 0.038 0.053 0.000 0.000 0.018

10254 rows × 9 columns

In [ ]:
X=x_tfidf
model=NMF(9)
model.fit(X)
nmf_features = model.transform(X)


components_df = pd.DataFrame(model.components_, columns=cv_tfidf.get_feature_names())
In [ ]:
#get top words per topic from output of NMF
for topic in range(components_df.shape[0]):
    tmp = components_df.iloc[topic]
    print(f'For topic {topic+1} the words with the highest value are:')
    print(tmp.nlargest(10))
    print('\n')
For topic 1 the words with the highest value are:
analytics     1.779665
insights      1.336162
marketing     1.220417
product       1.210290
team          0.830513
dashboards    0.659290
tableau       0.658031
drive         0.654642
key           0.649382
analysis      0.639048
Name: 0, dtype: float64


For topic 2 the words with the highest value are:
learning       1.867747
machine        1.628548
science        0.942588
models         0.817931
python         0.588413
scientist      0.523251
research       0.431293
engineering    0.430432
deep           0.414148
modeling       0.408016
Name: 1, dtype: float64


For topic 3 the words with the highest value are:
status        1.417434
employment    1.130901
gender        0.915227
protected     0.873540
disability    0.784341
applicants    0.758678
equal         0.747177
veteran       0.708243
national      0.706999
race          0.704608
Name: 2, dtype: float64


For topic 4 the words with the highest value are:
databases      1.069289
statistical    1.060432
analyze        0.821333
systems        0.774654
information    0.697255
reports        0.653154
techniques     0.632262
using          0.602108
computer       0.536549
database       0.529613
Name: 3, dtype: float64


For topic 5 the words with the highest value are:
ability       1.031929
management    0.984993
financial     0.949251
reporting     0.677091
support       0.654219
analysis      0.605176
reports       0.572117
required      0.554115
perform       0.528327
duties        0.525796
Name: 4, dtype: float64


For topic 6 the words with the highest value are:
clients       1.087149
services      1.009696
technology    0.950620
solutions     0.877288
cloud         0.794818
world         0.768271
team          0.723355
company       0.680723
client        0.674764
global        0.650484
Name: 5, dtype: float64


For topic 7 the words with the highest value are:
job            1.394811
analyst        1.179192
location       0.894534
description    0.880508
sql            0.835923
contract       0.735704
required       0.699952
com            0.615212
position       0.539015
title          0.515916
Name: 6, dtype: float64


For topic 8 the words with the highest value are:
requirements    1.725659
project         1.164828
technical       1.072192
systems         1.011238
user            0.956233
functional      0.891913
development     0.859471
test            0.824492
testing         0.745069
process         0.683603
Name: 7, dtype: float64


For topic 9 the words with the highest value are:
health        1.705555
research      1.537406
care          1.108308
healthcare    0.931911
medical       0.779510
required      0.485422
benefits      0.422514
scientist     0.379576
education     0.379036
insurance     0.373150
Name: 8, dtype: float64


K-Means

In [ ]:
#running kmean clustering on output of NMF
kmeans9 = KMeans(n_clusters=3,random_state=555)
clustering_ori9 = kmeans9.fit_predict(h9)
kmeans9.cluster_centers_
Out[ ]:
array([[0.01962856, 0.01017603, 0.06681052, 0.01309059, 0.02786986,
        0.02646282, 0.01541258, 0.0208129 , 0.02192761],
       [0.01863507, 0.00747332, 0.004708  , 0.02056734, 0.03033206,
        0.0189335 , 0.02753325, 0.02881578, 0.01684467],
       [0.0204135 , 0.10233755, 0.01574177, 0.01393333, 0.00688354,
        0.02408186, 0.01092911, 0.00828861, 0.01434177]])
In [ ]:
#TSNE on kmeans of NMF output
labels= kmeans9.predict(h9)

label=["cluster0", "cluster1", "cluster2"]

model=TSNE(learning_rate=100)
Tsne_transformed=model.fit_transform(h9)


xs =Tsne_transformed[:,0]
ys=Tsne_transformed[:,1]
scatter=plt.scatter(xs,ys, c=labels, alpha=.6)

handles, _ = scatter.legend_elements(prop='colors')
plt.legend(handles, label)
Out[ ]:
<matplotlib.legend.Legend at 0x7f95b8b4b750>
In [ ]:
#comparing outputs of kmeans clusters (on output of NMF on TFIDF) to actual job descriptions

cluster_comparison = pd.DataFrame(kmeans9.predict(h9), job_titles)
cluster_comparison["cluster"]=cluster_comparison[0]
cluster_comparison = cluster_comparison.drop(0, axis=1)
cluster_comparison.reset_index(inplace=True)
cluster_comparison.groupby(['index', 'cluster']).size()
Out[ ]:
index             cluster
Business Analyst  0          1874
                  1            18
                  2          2200
Data Analyst      0           316
                  1            80
                  2          1857
Data Scientist    0           326
                  1          1162
                  2          2421
dtype: int64

Hierarchical Clustering

In [ ]:
#hierarchical clustering on output of TFIDF only


labels=list(job_titles)



x = df_tfidf.values
normalized_x = normalize(x)
plt.figure(figsize=(15,12))

mergings = linkage(normalized_x, method='ward')

dendrogram(mergings,
    labels=labels,
    leaf_rotation=90,
    leaf_font_size=8
)



plt.show()
In [ ]:
#heirarchical clustering on output of NMF on TFIDF



labels=list(job_titles)



x=h9.values
normalized_x = normalize(x)
plt.figure(figsize=(15,12))

mergings = linkage(normalized_x, method='ward')

dendrogram(mergings,
    labels=labels,
    leaf_rotation=90,
    leaf_font_size=8
)



plt.show()

Job Description - t-SNE Visualization

In [ ]:
# downloading stop words from nltk package
stop_words = stopwords.words('english')

# to account for misspelling of conjunctions, we are adding conjunctions
# without the apostrophes
stop_words_mispelled = [word.replace("'", '') for word in stop_words]
stop_words = list(set(stop_words + stop_words_mispelled))
In [ ]:
# creating corpus using job descriptions with some minor text cleanup
corpus = data['JobDescription'].str.lower().str.replace('[-,;\.:]', '', regex=True)

# creating tfidf vector object to create a feature space
# token pattern is custom to allow single-word tokens
tfidf_obj = TfidfVectorizer(token_pattern=r'(?u)\b[a-z]+\b', stop_words=stop_words)

# creating truncated svd object to reduce the feature space
# using 100 dimensions according to documentation
svd_obj = TruncatedSVD(n_components=100)

# creating vectorization of corpus
tfidf_corpus = tfidf_obj.fit_transform(corpus)

# reducing dimensions
tfidf_corpus_red = svd_obj.fit_transform(tfidf_corpus)
In [ ]:
# creating a list of three job types
# will be used to add poinst to plots one type at a time
job_types = data['JobType'].unique()

# the figure will have n_rows * n_cols number of subplots
n_rows = 3
n_cols = 2

# instantiated a figure object
fig, axs = plt.subplots(n_rows, n_cols, figsize=(20, 20))

# initializing perplexity at 0 and will incrementally increase for each subplot
# learn rate was chosen to be 200 after several attempts at other learning rates
perplexity = 0
learn_rate = 200

# plotting the data for one subplot at a time
for row in range(n_rows):
    for col in range(n_cols):

        # running tsne on the data with the current loop perplexity
        perplexity += 10
        tsne = TSNE(n_components=2, perplexity=perplexity, learning_rate=learn_rate)
        tsne_reduced = tsne.fit_transform(tfidf_corpus_red)

        # adding points for one job type at at time
        for j_type in job_types:
            job_mask = data['JobType'] == j_type
            axs[row, col].scatter(tsne_reduced[job_mask, 0],
                                  tsne_reduced[job_mask, 1],
                                  label=j_type,
                                  alpha=0.30,
                                  edgecolors='#000000',
                                  s=40)
            
        # adding axis labels
        axs[row, col].set_title(f'Perplexity = {perplexity}', size=16, weight='bold')
        axs[row, col].set_xlabel('Component 1', size=14, weight='bold')
        axs[row, col].set_ylabel('Component 2', size=14, weight='bold')

        # adding legend
        axs[row, col].legend(loc='lower right')
    
plt.tight_layout()
In [ ]:
fig.savefig(output_path + f'tsne_learning_rate_{learn_rate}.png', facecolor='white', transparent=False)

Modeling

Unsupervised

Topic Modeling

Latent Dirichlet Allocation (LDA)

Job Descriptions
In [ ]:
tfidf = TfidfVectorizer(max_df=0.9,min_df=2,stop_words='english')
tfidf_fit = tfidf.fit_transform(data['JobDescription'])
In [ ]:
# generate 3 topic
lda = LatentDirichletAllocation(n_components=3,random_state=42)
lda_fit  = lda.fit(tfidf_fit)

# extracting the keywordss in each topic 
for id_value, value in enumerate(lda_fit.components_):
   print(f"The topic would be {id_value}") 
   print([tfidf.get_feature_names()[index] for index in value.argsort()[-10:]])
   print("\n")
The topic would be 0
['epiq', 'macquarie', 'cgi', 'qiskit', 'humana', 'guidehouse', 'gs', 'band', 'capgemini', 'quantum']


The topic would be 1
['systems', 'analysis', 'analyst', 'ability', 'skills', 'work', 'management', 'requirements', 'business', 'data']


The topic would be 2
['insights', 'work', 'python', 'science', 'team', 'business', 'machine', 'analytics', 'learning', 'data']


Data Scientist
In [ ]:
ds_lda = data[data['JobType']=='Data Scientist']['JobDescription']
In [ ]:
tfidf = TfidfVectorizer(max_df=0.9,min_df=2,stop_words='english')
ds_tfidf_fit = tfidf.fit_transform(ds_lda)
In [ ]:
# generate 3 topic
lda = LatentDirichletAllocation(n_components=3,random_state=42)
ds_lda_fit  = lda.fit(ds_tfidf_fit)

# extracting the keywordss in each topic 
for id_value, value in enumerate(ds_lda_fit.components_):
   print(f"The topic would be {id_value}") 
   print([tfidf.get_feature_names()[index] for index in value.argsort()[-10:]])
   print("\n")
The topic would be 0
['lab', 'drug', 'clinical', 'scientific', 'assays', 'chemistry', 'molecular', 'biology', 'laboratory', 'cell']


The topic would be 1
['years', 'development', 'analysis', 'science', 'analytics', 'learning', 'skills', 'team', 'work', 'business']


The topic would be 2
['tutor', 'tetra', 'band', 'middot', 'varsity', 'accenture', 'tutoring', 'gs', 'gsk', 'tutors']


Data Analyst
In [ ]:
da_lda = data[data['JobType']=='Data Analyst']['JobDescription']
In [ ]:
tfidf = TfidfVectorizer(max_df=0.9,min_df=2,stop_words='english')
da_tfidf_fit = tfidf.fit_transform(da_lda)
In [ ]:
# generate 3 topic
lda = LatentDirichletAllocation(n_components=3,random_state=42)
da_lda_fit  = lda.fit(da_tfidf_fit)

# extracting the keywordss in each topic 
for id_value, value in enumerate(da_lda_fit.components_):
   print(f"The topic would be {id_value}") 
   print([tfidf.get_feature_names()[index] for index in value.argsort()[-10:]])
   print("\n")
The topic would be 0
['hcc', 'guidewire', 'lowes', 'caregivers', 'clinicians', 'middot', 'half', 'mount', 'sinai', 'robert']


The topic would be 1
['registrar', 'analystengineer', 'etlelt', 'greetings', 'âwww', 'tumor', 'temple', 'conch', 'conchtech', 'bull']


The topic would be 2
['analytics', 'knowledge', 'analyst', 'team', 'ability', 'management', 'analysis', 'skills', 'work', 'business']


Business Analyst
In [ ]:
ba_lda = data[data['JobType']=='Business Analyst']['JobDescription']
In [ ]:
tfidf = TfidfVectorizer(max_df=0.9,min_df=2,stop_words='english')
ba_tfidf_fit = tfidf.fit_transform(ba_lda)
In [ ]:
# generate 3 topic
lda = LatentDirichletAllocation(n_components=3,random_state=42)
ba_lda_fit  = lda.fit(ba_tfidf_fit)

# extracting the keywordss in each topic 
for id_value, value in enumerate(ba_lda_fit.components_):
   print(f"The topic would be {id_value}") 
   print([tfidf.get_feature_names()[index] for index in value.argsort()[-10:]])
   print("\n")
The topic would be 0
['analysis', 'team', 'systems', 'project', 'ability', 'management', 'skills', 'work', 'requirements', 'data']


The topic would be 1
['font', 'arthur', 'lawrence', 'labeling', 'aston', 'carter', 'mso', 'harris', 'band', 'gs']


The topic would be 2
['fargo', '22nd', 'wells', 'virginia', 'cabinet', 'civilian', 'threat', 'leidos', 'usaa', 'accenture']


Top2Vec

Job Descriptions
In [ ]:
# parsing raw JobDescription text
docs_raw = list(data['JobDescription'].values)
docs_raw[:5]
Out[ ]:
["Are you eager to roll up your sleeves and harness data to drive policy change? Do you enjoy sifting through complex datasets to illuminate trends and insights? Do you see yourself working for a values-driven organization with a vision to tackle the most pressing injustices of our day?  We are looking to hire a bright, hard-working, and creative individual with strong data management skills and a demonstrated commitment to immigrant's rights. The Data Analyst will assist with analysis and reporting needs for Veras Center on Immigration and Justice (CIJ), working across its current projects and future Vera initiatives.  Who we are:  Founded in 1961, The Vera Institute is an independent, non-partisan, nonprofit organization that combines expertise in research, technical assistance, and demonstration projects to assist leaders in government and civil society examine justice policy and practice, and improve the systems people rely on for justice and safety. We study problems that impede human dignity and justice. We pilot solutions that are at once transformative and achievable. We engage diverse communities in informed debate. And we harness the power of evidence to drive effective policy and practice What were doing:  We are helping to build a movementamong government leaders, advocates, and the immigration legal services communitytowards universal legal representation for immigrants facing deportation. In the face of stepped-up immigration enforcement, millions of non-citizens are at risk of extended detention and permanent separation from their families and communities. Veras Center on Immigration and Justice (CIJ) partners with government, non-profit partners, and communities to improve government systems that affect immigrants and their families. CIJ administers several nationwide legal services programs for immigrants facing deportation, develops and implements pilot programs, provides technical assistance, and conducts independent research and evaluation.  Thats where you come in: The Data Analyst will support the Centers programmatic efforts through regular monitoring and reporting of federal government and subcontractor data. CIJ manages several proprietary databases that run on AWS and Caspio and uses SQL, R, and Python to manage data. This is an opportunity to help shape an innovative national research and policy agenda as part of a dedicated team of experts working to improve access to justice for non-citizens.  Vera seeks to hire a Data Analyst to work on various data management projects with its Center on Immigration and Justice (CIJ). In collaboration with other Data Analysts, this position will involve work across several projects, such as the Unaccompanied Childrens Program (UCP), a program to increase legal representation for immigrant children facing deportation without a parent or legal guardian. The position may cover additional duties for the Legal Orientation Program for Custodians (LOPC), which educates the custodians of unaccompanied children about their rights and the immigration court process.  About the role:  As a Data Analyst, you will report to a member of the research team and work in close collaboration with other Vera staff on ongoing database management, monitoring, reporting, and analysis projects. Youll support the team by taking ownership of ongoing monitoring and reporting tasks involving large data sets. Other principal responsibilities will include: Supporting research staff by preparing large datasets for analysis, including merging, cleaning, and recoding data; Providing insights into program performance through summary statistics and performance indicators; Producing timely reports on Vera projects for team members and stakeholders; Improving recurring reporting processes by optimizing code and producing subsequent documentation; Coordinating database management tasks such as participating in new database design, modifying existing databases, and communicating with outside engineers and subcontractors; Developing codebooks and delivering user trainings through webinars and database guides; Building and maintaining interactive dashboards; Documenting and correcting data quality issues; Working with supervisors to prioritize program needs; Assisting on other projects and tasks as assigned. About you:  Youre committed to improving issues affecting immigrants in the United States. Applicants with personal experiences with the immigration system are especially encouraged to apply.  Youre just getting started in your career and have 1 2 years of professional or internship experience working with large datasets and preparing data for analysis.  You have a real enthusiasm for working with data.  You are comfortable writing queries in SQL, R, and/or Python, or have a solid foundation coding in other programming languages used to manipulate data. Experience working collaboratively using tools like Git/GitHub is a plus.  You have exceptional attention to detail, strong problem-solving ability and logical reasoning skills, and the ability to detect anomalies in data.  Youre able to work on multiple projects effectively and efficiently, both independently and collaboratively with a team.  This position involves working with secure data that may require government security clearance. That clearance is restricted to U.S. citizens and citizens of countries that are party to collective defense agreements with the U.S. The list of those countries is detailed on this webpage. An additional requirement of that clearance is residence in the United States for at least three of the last five years.  How to apply:  Please submit cover letter and resume. Applications will be considered on a rolling basis until position is filled. Online submission in PDF format is preferred. Applications with no cover letter attached will not be considered. The cover letter should address your interest in CIJ and this position.  However, if necessary, materials may be mailed or faxed to  ATTN: Human Resources / CIJ Data Analyst Recruitment  Vera Institute of Justice  34 35th St, Suite 4-2A  Brooklyn, NY 11232  Fax: (212) 941-9407  Please use only one method (online, mail or fax) of submission.  No phone calls, please. Only applicants selected for interviews will be contacted.  Vera is an equal opportunity/affirmative action employer. All qualified applicants will be considered for employment without unlawful discrimination based on race, color, creed, national origin, sex, age, disability, marital status, sexual orientation, military status, prior record of arrest or conviction, citizenship status, current employment status, or caregiver status.  Vera works to advance justice, particularly racial justice, in an increasingly multicultural country and globally connected world. We value diverse experiences, including with regard to educational background and justice system contact, and depend on a diverse staff to carry out our mission.  For more information about Vera and CIJs work, please visit www.vera.org.  Powered by JazzHR",
 'Overview  Provides analytical and technical support for the integration of multiple data sources used to prepare internal and external reporting for the Quality Management team and business stakeholders. Provides support and analytical insight for Quality Incentive measures, HEDIS measures, and Quality Improvement initiatives. Monitors, analyzes, and communicates Quality performance related to benchmarks. Collaborates with clinical and operational teams within Quality Management, as well as with CHOICE Clinical Operations and Business Intelligence & Analytics (BIA). Participates in data validation of current reporting and dashboards. Monitors data integrity of databases and provides recommendation accordingly. Participates in the development of internal dashboards and databases. Works under general direction.  Responsibilities Provides support and analytical insight for Quality Incentive measures, HEDIS measures, and Quality Improvement initiatives. Monitors internal performance against benchmarks through analysis. Participates in the identification, development, management, and monitoring of quality improvement initiatives. Collaborates with Education staff and makes recommendations for areas of focus in training of assessors and care managers, based on analysis of performance trends. Researches and identifies technical/operational problems surrounding systems/applications; communicates/refers complex and unresolved problems to management, Business Intelligence & Analytics (BIA), and/or IT. Conducts ad hoc analyses to help identify operational gaps in care; drafts presentations, reports, publications, etc. regarding results of analyses. Communicates results of data analysis to non-technical audiences. Participates in prioritization of departmental goals based on identification of operational gaps in care. Participates in establishing data quality specifications and designs. Coordinates and supports integrated data systems for analyzing and validating information. Identifies and makes recommendations for reporting re-designs and platforms for reporting (e.g. automating a manual Excel file using macros, developing a MicroStrategy dashboard to replace manually updated Excel dashboards, moving data storage from Excel to Access, etc.), as needed. Trains staff on use of new/updated systems and related topics. Assists Quality management team with database and department reports. Conducts operations review and analysis of processes and procedures, issues report of findings and implements approved changes as required. Identifies and recommends software needs and applications to accomplish required reporting. Retrieves, compiles, reviews and ensures accuracy of data from databases; researches and corrects discrepancies, as needed. Analyzes data from internal and external sources. Identifies and resolves data quality issues before reports are generated. Works with staff to correct data entry errors. Analyzes data, identifies trends, reoccurring problems, statistically significant findings and prepares reports/summaries for management review. Acts as a liaison between Quality Management, CHOICE Clinical Operations, and BIA. Reviews and identifies trends and variances in data and reports. Researches findings and determines appropriateness of elevating identified issues to leadership for further review/evaluation/action. Monitors and maintains files by ensuring that files are current and of relevant nature. Analyzes and corrects error reports to ensure timely and accurate data; develops corrective actions to prevent errors where possible. Participates in special projects and performs other duties, as needed. Qualifications Education: Bachelors degree in bio/statistics, epidemiology, mathematics, computer science, social sciences, a related field or the equivalent work experience required. Masters degree with concentration in computer science, data science, or statistics preferred.   Experience: Minimum of two years experience performing increasingly complex data analysis and interpretation, preferably in a managed care or health care setting, required. Experience with data extraction and manipulation required. Experience with relational databases and programming experience in SQL or PL/SQL required. Experience with claims data and health plan quality metrics (e.g., HEDIS, QARR) preferred. Proficiency conducting statistical analysis with R, SAS, Stata or other statistical software preferred. Advanced personal computer skills, including Microsoft Word, PowerPoint, Excel, and Access required. Effective oral, written communication and interpersonal skills required. Ability to multi task in a fast-paced environment required.  CA2020',
 'We’re looking for a Senior Data Analyst who has a love of mentorship, data visualization, and generating actionable insights from raw data. In this role, you’ll have the opportunity to be an organizational influencer, who will generate insights with a good degree of autonomy, and partner with data science to grow deeper analytical skills. You will be joining the Insights & Analytics team, a team tasked with developing insights and reporting to support our customers and advisors’ needs. This team sits within the Customer Operations team, but is also connected to the Product organization.  In this role, you will work mainly with Customer Operations stakeholders to set KPIs and evaluate the effectiveness of current strategies and workflows. You will be involved in many aspects of data operations, from data auditing to building dashboards and analytical insights. For example, you will review the code of more junior analysts and organize coding workshops. You will build metrics to evaluate the performance of our advisors, eliminating confounding variables and creating weighted measures that account for individual success. You will define metrics and create dashboards to track the success of the current strategic direction. You will analyze the text-based interactions with our advisors to improve care quality. You will also analyze interactions with our chatbot to improve the responses of the bot to match customer expectations. You will collaborate with the Data Engineering team in pursuit of better data modeling and you can collaborate with the Data Science team on more modeling-heavy pursuits, like topic modeling or search recommendations. You will be a mentor to all other members of the analytics team, which demands that you excel in a culture of teaching and learning.  You will report to the Manager of the Insights & Analytics team, Customer Operations in Squarespace’s New York offices.  RESPONSIBILITIES Design and develop KPIs, reports, and dashboards using BI/data visualization tools that clearly track the effectiveness of current strategies Execute and communicate impactful analyses, using SQL and R/Python to improve the customer experience and drive Customer Operations and Product roadmap Raise the skill level of the analytics team through reviewing code, organizing training sessions, acting as a mentor, and introducing new analytical methods and tools Help design and evaluate experiments and pilots, setting clear hypotheses and success metrics, as well as using the right statistical methods for analyses Effectively partner with Customer Operations and Product stakeholders, as well as our Data Engineering team QUALIFICATIONS 4 or more years of experience as a data analyst, data scientist, or in a relevant field in which you have worked with large datasets to transform data to meaningful insights Mastery of SQL, and experience with R/Python Experience with dashboarding/BI tools like Chartio, Looker, Tableau Experience with data cleaning, analysis, and presentation to non-technical stakeholders Bachelor’s degree in a quantitative or logic-driven discipline Intellectual curiosity and a deep love of mentorship and generating actionable insights Preferred: Advanced degree (Master’s or PhD) in quantitative field Preferred: Experience with topic modeling, text analytics/natural language processing We are hiring at various experience levels and we’re particularly interested in having a diverse team with a broad set of skills and viewpoints. If this seems like an opportunity you’d like to explore, but you’re not sure if you qualify, we encourage you to apply anyway!  About Squarespace  Squarespace makes beautiful products to help people with creative ideas succeed. By blending elegant design and sophisticated engineering, we empower millions of people — from individuals and local artists to entrepreneurs shaping the world’s most iconic businesses — to share their stories with the world. Squarespace’s team of more than 900 is headquartered in downtown New York City, with offices in Dublin and Portland. For more information, visit www.squarespace.com/about.  Today, more than a million people around the globe use Squarespace to share different perspectives and experiences with the world. Not only do we embrace and celebrate the diversity of our customer base, but we also strive for the same in our employees. At Squarespace, we are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender, gender identity or expression, or veteran status. We are proud to be an equal opportunity workplace.',
 'Requisition NumberRR-0001939 Remote:Yes We collaborate. We create. We innovate.  Intrigued?  You’re a business professional with an innate curiosity that thrives in a dynamic and Agile environment. You appreciate teamwork, exemplify integrity, perseverance, flexibility, and a generosity of spirit… if this sounds like you, then please apply – we’d love to meet you!  Celerity is expanding and on the hunt for the savvy, creative, and analytically sound individuals that are motivated by solving complex problems. We’re in the business of transforming how people, process, and systems co-exist, while improving operational efficiencies and user-driven interactions. We work with groundbreaking companies, melding expertise in Digital Strategy, Technology, Creative, and Business Transformation. The health and safety of our employees is our top priority. Due to the pandemic, all our employees are working remotely and we will be conducting candidate interviews by video. This position will continue to be remote once the COVID-19 crisis has abated.  What You’ll Do  • Perform data visualization and comparisons across multiple data sources / systems of record to provide value-added insights to support critical business operations teams • Build scripts to perform low-level analysis and data investigation working in conjunction with other Celerity or client team members to improve data accuracy and make more informed decisions • Support movement of large amounts of data and existing processes into new ecosystems including looking for opportunities for greater automation of existing processes via SQL • Produce customized reports and visualizations as well as other ad-hoc analysis in support of time-sensitive business requests • Demonstrate flexibility to expand beyond a pure data analysis role including upstream requirements gathering or downstream process recommendations/implementations centered around a strong understanding of client data  About You  • Bachelor’s Degree in Business, Management, or a quantitative field 3+ years SQL development of complex queries • 2+ years professional experience • Expertise designing and implementing ETL processes You have Teradata SQL Experience (preferred) • Experience with Data Visualization such as Tableau or PowerBI • You understand and have experience with multidimensional analysis and queries • Strong communication skills • You have an analytical mindset to be able to ask probing questions to eradicate ambiguity and are willing to point out flaws in logic and identify better ways to do things • You have “Good Data Intuition” • You are not an order taker, but an analyst that can sniff out trends, identify patterns, and answer open ended questions • Willingness to expand scope of work beyond data to include business analyst, process analyst, or project manager type activities as necessary We Are Celerity  Millions of people every day use websites, applications, and business processes designed and built by Celerity. As a consultancy of technologists, creatives, and business experts, we provide the action plans, teams, and solutions our clients need to transform the way they do business and deepen engagement with their customers.  Celerity empowers autonomy and accountability. We understand life happens and want our team to do what is needed to get the work done, enjoy it along the way, and have plenty of time and energy for family and friends, vacations, hobbies, learning of languages, you name it! At Celerity, our focus is on bringing synergies to business process, technology, and creative chops to deliver inspirational solutions. We love giving kudos and celebrating achievements. We provide an inclusive and flexible environment that fosters collaboration, quick pivots, and quality work.  Originally founded in 2002, Celerity was acquired in 2015 by AUSY, a France-based IT consultancy and engineering services firm. With 8 regional offices and 400+ employees, Celerity’s clients span a variety of industries including media and entertainment, healthcare, financial services, manufacturing, non-profit, and hospitality.  You can also get to know us better on Instagram, Facebook, Dribbble, LinkedIn, and Twitter.  Build Your Career  At Celerity, our professional growth plan focuses on individualizing everyone’s track through incorporating personal interests with training and mentorship, while further expanding existing strong suits. We recognize in this ever-evolving digital world, there is extreme value in learning new skills and technologies. Celerity provides opportunities to shape skill sets, work with an amazing team of industry-leading consultants and exciting clients, in addition to professional trainings and certifications that best align with your career goals. We are proud to be an Equal Opportunity Employer. All employment decisions shall be made without regard to age, race, creed, color, religion, sex, national origin, ancestry, disability status, veteran status, sexual orientation, gender identity or expression, genetic information, marital status, citizenship status or any other basis as protected by federal, state, or local law. Celerity is committed to providing veteran employment opportunities to our service men and women.',
 "ABOUT FANDUEL GROUP  FanDuel Group is a world-class team of brands and products all built with one goal in mind — to give fans new and innovative ways to interact with their favorite games, sports, teams, and leagues. That's no easy task, which is why we're so dedicated to building a winning team. And make no mistake, we are here to win, but we believe in winning right. That means we'll never compromise when it comes to looking out for our teammates. From our many opportunities for professional development to our generous insurance and paid leave policies, we're committed to making sure our employees get as much out of FanDuel as we ask them to give.  FanDuel Group is based in New York, with offices in California, New Jersey, Florida, Oregon and Scotland. Our brands include: FanDuel — A game-changing real-money fantasy sports app FanDuel Sportsbook — America's #1 sports betting app TVG — The best-in-class horse racing TV/media network and betting platform FanDuel Racing — A horse racing app built for the average sports fan FanDuel Casino & Betfair Casino — Fan-favorite online casino apps FOXBet — A world-class betting platform PokerStars — The premier online poker product THE POSITION Our roster has an opening with your name on it  We are looking for a Reporting Data Analyst to join our growing Analytics team working with all aspects of regulatory and compliance reporting. You will work with Engineering, Compliance and regulatory stakeholders to ensure that reports produce accurate, timely results.  THE GAME PLAN Everyone on our team has a part to play Work with internal and external stakeholders to specify requirements for regulatory and compliance reports Use your experience with SQL to design and develop reports in a number of databases Collaborate with Engineering to develop databases in support of current and future reporting needs Take ownership of the schedule of reports, investigating and remediating any issues related to accuracy or availability Identifying and acting on opportunities to further improve regulatory and compliance reporting Supporting internal stakeholders in Compliance, Legal and other departments with data analysis THE STATS What we're looking for in our next teammate Highly numerate Bachelor's degree 1-2 years' experience working with data, reporting and analysis Strong technical and analytic skills – SQL and Python are essential, Excel is nice to have Experience using automation and visualization tools to deliver information to a range of stakeholders Experience of analyzing and manipulating large data sets across multiple data sources Desire to learn new technical skills and practice continuous personal development THE CONTRACT We treat our team right  Competitive compensation is just the beginning. As part of our team, you can expect: An exciting and fun environment committed to driving real growth Opportunities to build really cool products that fans love Mentorship and professional development resources to help you refine your game Flexible vacation allowance to let you refuel Hall of Fame benefit programs and platforms FanDuel Group is an equal opportunities employer. Diversity and inclusion in FanDuel means that we respect and value everyone as individuals. We don't tolerate bias, judgement or harassment. Our focus is on developing employees so that they reach their full potential."]
In [ ]:
# fitting the Top2Vec model using USE embeddings on the raw job description text
# top2vec_raw = Top2Vec(docs_raw, 
#                       speed = 'deep-learn', 
#                       embedding_model = 'universal-sentence-encoder')

# saving raw Top2Vec model
# top2vec_raw.save(root_path + "/data/top2vec_raw")
2021-03-14 20:40:50,740 - top2vec - INFO - Pre-processing documents for training
2021-03-14 20:41:07,828 - top2vec - INFO - Downloading universal-sentence-encoder model
2021-03-14 20:41:25,304 - top2vec - INFO - Creating joint document/word embedding
INFO:top2vec:Creating joint document/word embedding
2021-03-14 20:41:56,011 - top2vec - INFO - Creating lower dimension embedding of documents
INFO:top2vec:Creating lower dimension embedding of documents
2021-03-14 20:42:33,999 - top2vec - INFO - Finding dense areas of documents
INFO:top2vec:Finding dense areas of documents
2021-03-14 20:42:34,505 - top2vec - INFO - Finding topics
INFO:top2vec:Finding topics
In [ ]:
# loading raw Top2Vec model
top2vec_raw = Top2Vec.load(root_path + "/data/top2vec_raw")
In [ ]:
# getting number of topics
top2vec_raw.get_num_topics()
Out[ ]:
48
In [ ]:
# getting top 3 topics
topic_words_raw, word_scores_raw, topic_nums_raw = top2vec_raw.get_topics(3)
In [ ]:
# plotting word clouds for the top 5 topics
for topic in topic_nums_raw:
  top2vec_raw.generate_topic_wordcloud(topic)
Data Scientist
In [ ]:
# parsing raw data scientist text
ds_raw = list(data[data['JobType'] == 'Data Scientist']['JobDescription'].values)
ds_raw[:5]
Out[ ]:
['ABOUT HOPPER  At Hopper, we’re on a mission to make booking travel faster, easier, and more transparent. We are leveraging the power that comes from combining massive amounts of data and machine learning to build the world’s fastest-growing travel app -- one that enables our customers to save money and travel more. With over $235M CAD in funding from leading investors in both Canada and the US, Hopper is primed to continue its path toward becoming the go-to way to book travel as the world continues its shift to mobile.  Recognized as the fastest-growing travel app by Forbes and one of the world’s most innovative companies by Fast Company two years in a row, Hopper has been downloaded over 40 million times and has helped travelers plan over 100 million trips and counting. The app has received high praise in the form of mobile accolades such as the Webby Award for Best Travel App of 2019, the Google Play Award for Standout Startup of 2016 and Apple’s App Store Best of 2015.  Take off with us!  THE ROLE  Hopper is looking for a data-savvy individual to join our team as a Data Scientist and lead data-centric product development and complex business intelligence projects within our core air travel business unit. Every day you would draw powerful insights from our real-time feed of billions of flight search results and archives of several trillion data points. To succeed at Hopper you need the talent, passion, and experience to thrive in a highly performing company. IN THIS ROLE YOU WILL: Frame and conduct complex exploratory analyses needed to deepen our understanding of Hopper users. Partner with product, business and strategy teams to leverage this user understanding for product improvements and other initiatives Use machine learning and big data tools on tremendously large and complex data sets to enhance our data-driven, personalized travel advice Conduct research into various aspects of our business and employ statistical and modeling techniques when appropriate to make recommendations to non-technical stakeholders Create advanced dashboards for product experiment tracking and business unit performance analysis using Amplitude and Tableau Find effective ways to simplify and communicate analyses to a non-technical audience. A PERFECT CANDIDATE HAS: A degree in Math, Statistics, Computer Science, Engineering or other quantitative disciplines Extremely strong analytical and problem-solving skills Proven ability to communicate complex technical work to a non-technical audience A strong passion for and extensive experience in conducting empirical research and answering hard questions with data Experience with a data visualization tool (Tableau preferred) and project analysis tool such as Amplitude Experience with relational databases and SQL, especially Hive Experience working with extremely large data sets Experience in Pandas, R, SAS or other tools appropriate for large scale data preparation and analysis Experience with data mining, machine learning, statistical modeling tools and underlying algorithms Proficiency with Unix/Linux environments BENEFITS  • Well-funded and proven startup with large ambitions, competitive salary and stock options • Dynamic and entrepreneurial team where pushing limits is everyday business • 100% employer paid medical, dental, vision, disability and life insurance plans • Access to a 401k (US) or Retirement Savings Plan (Canada)',
 'At Noom, we use scientifically proven methods to help our users create healthier lifestyles, and manage important conditions like Type-II Diabetes, Obesity, and Hypertension. Our Engineering team is at the forefront of this challenge, solving complex technical and UX problems on our mobile apps that center around habits, behavior, and lifestyle.  We are looking for a Data Scientist to join our Data team and help us ensure that we apply the best approaches to data analysis and research, artificial intelligence, and machine learning.  What You\'ll Like About Us: We work on problems that affect the lives of real people. Our users depend on us to make positive changes to their health and their lives. We base our work on scientifically-proven, peer-reviewed methodologies that are designed by medical professionals. We are a data-driven company through and through. We\'re a respectful, diverse, and dynamic environment in which Engineering is a first-class citizen, and where you\'ll be able to work on a variety of interesting problems that affect the lives of real people. We offer a generous budget for personal development expenses like training courses, conferences, and books. You\'ll get three weeks\' paid vacation and a flexible work policy that is remote- and family-friendly (about 50% of our engineering team is fully remote). We worry about results, not time spent in seats. What We\'ll Like About You: You have 4+ years of experience as a Data Scientist or Data Analyst in a similarly-sized organization, with a proven record of analysis and research that positively impacts your team. You possess excellent communication skills and the ability to clearly communicate technical concepts to a non-technical audience You possess excellent SQL/relational algebra skills, ideally with at least a basic knowledge of how different types of databases (e.g.: column vs row storage) work. You have a superior knowledge of statistical analysis methods, such as input selection, logistic and standard regression, etc. You are comfortable writing Python code, and have good working knowledge of pandas and numpy. We don\'t expect you to write production-quality code, but you should have some programming experience. You are comfortable with at least "medium data" technologies and how to transcend the "memory bound" nature of most analytics tools.',
 "Decode_M  https://www.decode-m.com/  Data Science Manager : Job Description  We’re hiring a Data Science Manager in our New York office to lead Decode_M’s Data Science team in developing new solutions for our clients and advancing the science of momentum through our proprietary product.  This position, based in New York City, requires an interest in managing Decode_M’s data science team, workflow and offerings. It demands excellent organizational skills, a natural curiosity, an eagerness to dive into the deep end, and a hunger to learn and grow while having a blast. Expect to learn a lot about cool companies, new products, and the latest in analytics, research and strategy.  About You You have an advanced degree in data science, mathematics, or another highly quantitative field, and are comfortable understanding and utilizing recent academic research in your work. You have 3+ years of relevant work experience using applied statistics and/or machine learning. Bonus points if you’ve worked in a consulting, or other client-facing role before You have 1+ years of management experience. You are fluent in Python and R, as well as a robust numerical computing and data science stack in each language. Not only do you produce readable and well-documented code, but elevate your team’s work through thoughtful code review and mentorship You are highly consultative, engaged, and detail-oriented. You see beyond the question you are asked to understand client objectives, and chart a course for long-term success in our relationships You are an autodidact, who loves to figure out new solutions to problems by diving deep into StackOverflow, Arxiv.org, or your preferred destination. You are curious, inquisitive, proactive, organized, and methodical. You like to use substantiated data to both identify and solve problems, rather than trust your gut alone   What You'll Do  Lead a team. Help to set the vision for the Data Science team and follow through on that vision. Be a mentor to a small, close-knit team of data scientists, helping them to develop professionally  Drive New Product Development: Decode_M has developed a proprietary algorithm to quantify the cultural momentum of brands, entities, people and movements. You will steward the development of the algorithm and be part of the product team to bring it to market with new customers  Manage Process. Work across teams and with Decode_M’s leadership to ensure that our work is top notch, deadlines are met, and projects are resourced appropriately. Foster an environment where the team is empowered to do its best work  Build Tools. Use data from social media, online reviews, mainstream media sites, search, web traffic, quantitative research, CRMs or other databases, etc. to uncover intelligence around growth opportunities and business challenges. Build software and/or repeatable processes to scale this work  Grow our Capabilities. Advance our expertise in applying statistics and machine learning to digital trends, CRM data, and quantitative research. Know the state of the field well enough to know when to reach for best-in-class tools and when to design your own solutions",
 "Sapphire Digital seeks a dynamic and driven mid-level Data Analyst/QA to join our growing New Jersey team with experience in the healthcare domain. This role will perform in-depth analysis of our internal data management systems to identify, analyze, and interpret trends and/or patterns in order to provide actionable recommendations for data integrity. The Data Analyst/QA will work closely with multiple internal teams to collaborate on different stages of the data lifecycle and assist in developing data quality processes. In this position, you'll be responsible for: Preparing and conducting analyses and studies, needs assessment, and requirements analysis to align systems and solutions Overseing data QA functions to ensure data integrity and accuracy Monitoring all production data processes and validation systems Performing data investigations, root cause analysis, data profiling, and data lineage activities Performing research and provides recommendations for data processes and updates Reviewing and analyzes specific data elements for accuracy and quality assurance Driving the resolution of the identified data quality problems Maintaining documentation as required Partnering with BI in the development of reports and presentations Maintaining data quality rules, data standards and data governance policies Supporting special project work as assigned by the data managers Collaborating effectively with cross functional teams including Data, Business Intelligence, Business Applications, and Engineering  You might be a good fit if you have: Bachelor's degree (B. A. / B. S.) from four-year college or university; 4+ years related experience and/or training; or equivalent combination of education and experience 5+ years of experience as a data analyst or in a related field Expertise with query languages (SQL) required Experience working with large complex datasets including healthcare data such as medical claims, clinical information, demographic data and program activity results Exposure with one or more of the following programming languages: SAS, Python, R Prior experience working with Healthcare data, or in the Healthcare field preferred Experience with data visualization tools and methodologies (Excel and PowerBI) Demonstrated experience in analyzing large data sets and relational databases Ability to manage time and priorities of multiple projects/tasks Displays professionalism and ability to learn quickly Extremely detail oriented Strong oral and written communication skills Ability to collaborate closely with multiple teams across with a wide range of technical backgrounds",
 'Director, Data Science - (200537) Description Edelman Intelligence is seeking a Director-level Data Scientist. The person in this role is will work to help drive our predictive analytics business by building out the Big Data / machine learning / AI capabilities of the Predictive Intelligence capabilities. This person will contribute significantly towards client deliverables, help develop new analytical approaches and offerings, and mentor more junior team members.  Responsibilities:  • Assist in the leadership of day to day project execution, including client deliverables, status updates, and project management • Efficiently manage data from disparate sources, distilling into datasets prepared for data science • Execute statistical models to support client projects • Prepare client facing material (example: PowerPoint slides and charts), distilling analytical insights effectively into stories for clients • Develop standardized code and processes that can be easily used by the larger team • Support senior staff in the development of client proposals • Mentor junior analytics team members and provide training on analytical offerings throughout the organization Qualifications: • Academic background in data science, including big data, machine learning, and AI • Comfortable with prepping and visualizing data • Experience in related programming languages, e.g. Python, Keras, Tensorflow, Pytorch, Scrapy, R, – experience in more than one language is a preferred • Excellent organizational and communication skills, coupled with the ability to adapt to new conditions, assignments and deadlines • Knowledge of marketing, market research, and economics About Us  Edelman is a global communications firm that partners with businesses and organizations to evolve, promote and protect their brands and reputations. Our 6,000 people in more than 60 offices deliver communications strategies that give our clients the confidence to lead and act with certainty, earning the trust of their stakeholders. Our honors include the Cannes Lions Grand Prix for PR; Advertising Age’s 2019 A-List; the Holmes Report’s 2018 Global Digital Agency of the Year; and, five times, Glassdoor’s Best Places to Work. Since our founding in 1952, we have remained an independent, family-run business. Edelman owns specialty companies Edelman Intelligence (research) and United Entertainment Group (entertainment, sports, lifestyle).  Click here to view a short video about life at Edelman.  Edelman is an equal opportunity employer of all protected classes, including veterans and individuals with disabilities. Job : Research and Analytics Primary Location : United States-New York Job Type : Experienced Schedule : Full-time Job Posting : Jul 13, 2020, 11:30:21 AM']
In [ ]:
# fitting the Top2Vec model using USE embeddings on the raw data scientist text
# top2vec_ds_raw = Top2Vec(ds_raw, 
#                          speed = 'deep-learn', 
#                          embedding_model = 'universal-sentence-encoder')

# saving raw Top2Vec model
# top2vec_ds_raw.save(root_path + "/data/top2vec_ds_raw")
2021-03-14 20:49:05,511 - top2vec - INFO - Pre-processing documents for training
2021-03-14 20:49:12,489 - top2vec - INFO - Downloading universal-sentence-encoder model
2021-03-14 20:49:30,376 - top2vec - INFO - Creating joint document/word embedding
INFO:top2vec:Creating joint document/word embedding
2021-03-14 20:49:42,537 - top2vec - INFO - Creating lower dimension embedding of documents
INFO:top2vec:Creating lower dimension embedding of documents
2021-03-14 20:50:18,319 - top2vec - INFO - Finding dense areas of documents
INFO:top2vec:Finding dense areas of documents
2021-03-14 20:50:18,477 - top2vec - INFO - Finding topics
INFO:top2vec:Finding topics
In [ ]:
# loading raw Top2Vec model
top2vec_ds_raw = Top2Vec.load(root_path + "/data/top2vec_ds_raw")
In [ ]:
# getting number of topics
top2vec_ds_raw.get_num_topics()
Out[ ]:
14
In [ ]:
# getting top 5 topics
topic_words_ds_raw, word_scores_ds_raw, topic_nums_ds_raw = top2vec_ds_raw.get_topics(3)
In [ ]:
# plotting word clouds for the top 3 topics
for topic in topic_nums_ds_raw:
  top2vec_ds_raw.generate_topic_wordcloud(topic)
Data Analyst
In [ ]:
# parsing raw data analyst text
da_raw = list(data[data['JobType'] == 'Data Analyst']['JobDescription'].values)
da_raw[:5]
Out[ ]:
["Are you eager to roll up your sleeves and harness data to drive policy change? Do you enjoy sifting through complex datasets to illuminate trends and insights? Do you see yourself working for a values-driven organization with a vision to tackle the most pressing injustices of our day?  We are looking to hire a bright, hard-working, and creative individual with strong data management skills and a demonstrated commitment to immigrant's rights. The Data Analyst will assist with analysis and reporting needs for Veras Center on Immigration and Justice (CIJ), working across its current projects and future Vera initiatives.  Who we are:  Founded in 1961, The Vera Institute is an independent, non-partisan, nonprofit organization that combines expertise in research, technical assistance, and demonstration projects to assist leaders in government and civil society examine justice policy and practice, and improve the systems people rely on for justice and safety. We study problems that impede human dignity and justice. We pilot solutions that are at once transformative and achievable. We engage diverse communities in informed debate. And we harness the power of evidence to drive effective policy and practice What were doing:  We are helping to build a movementamong government leaders, advocates, and the immigration legal services communitytowards universal legal representation for immigrants facing deportation. In the face of stepped-up immigration enforcement, millions of non-citizens are at risk of extended detention and permanent separation from their families and communities. Veras Center on Immigration and Justice (CIJ) partners with government, non-profit partners, and communities to improve government systems that affect immigrants and their families. CIJ administers several nationwide legal services programs for immigrants facing deportation, develops and implements pilot programs, provides technical assistance, and conducts independent research and evaluation.  Thats where you come in: The Data Analyst will support the Centers programmatic efforts through regular monitoring and reporting of federal government and subcontractor data. CIJ manages several proprietary databases that run on AWS and Caspio and uses SQL, R, and Python to manage data. This is an opportunity to help shape an innovative national research and policy agenda as part of a dedicated team of experts working to improve access to justice for non-citizens.  Vera seeks to hire a Data Analyst to work on various data management projects with its Center on Immigration and Justice (CIJ). In collaboration with other Data Analysts, this position will involve work across several projects, such as the Unaccompanied Childrens Program (UCP), a program to increase legal representation for immigrant children facing deportation without a parent or legal guardian. The position may cover additional duties for the Legal Orientation Program for Custodians (LOPC), which educates the custodians of unaccompanied children about their rights and the immigration court process.  About the role:  As a Data Analyst, you will report to a member of the research team and work in close collaboration with other Vera staff on ongoing database management, monitoring, reporting, and analysis projects. Youll support the team by taking ownership of ongoing monitoring and reporting tasks involving large data sets. Other principal responsibilities will include: Supporting research staff by preparing large datasets for analysis, including merging, cleaning, and recoding data; Providing insights into program performance through summary statistics and performance indicators; Producing timely reports on Vera projects for team members and stakeholders; Improving recurring reporting processes by optimizing code and producing subsequent documentation; Coordinating database management tasks such as participating in new database design, modifying existing databases, and communicating with outside engineers and subcontractors; Developing codebooks and delivering user trainings through webinars and database guides; Building and maintaining interactive dashboards; Documenting and correcting data quality issues; Working with supervisors to prioritize program needs; Assisting on other projects and tasks as assigned. About you:  Youre committed to improving issues affecting immigrants in the United States. Applicants with personal experiences with the immigration system are especially encouraged to apply.  Youre just getting started in your career and have 1 2 years of professional or internship experience working with large datasets and preparing data for analysis.  You have a real enthusiasm for working with data.  You are comfortable writing queries in SQL, R, and/or Python, or have a solid foundation coding in other programming languages used to manipulate data. Experience working collaboratively using tools like Git/GitHub is a plus.  You have exceptional attention to detail, strong problem-solving ability and logical reasoning skills, and the ability to detect anomalies in data.  Youre able to work on multiple projects effectively and efficiently, both independently and collaboratively with a team.  This position involves working with secure data that may require government security clearance. That clearance is restricted to U.S. citizens and citizens of countries that are party to collective defense agreements with the U.S. The list of those countries is detailed on this webpage. An additional requirement of that clearance is residence in the United States for at least three of the last five years.  How to apply:  Please submit cover letter and resume. Applications will be considered on a rolling basis until position is filled. Online submission in PDF format is preferred. Applications with no cover letter attached will not be considered. The cover letter should address your interest in CIJ and this position.  However, if necessary, materials may be mailed or faxed to  ATTN: Human Resources / CIJ Data Analyst Recruitment  Vera Institute of Justice  34 35th St, Suite 4-2A  Brooklyn, NY 11232  Fax: (212) 941-9407  Please use only one method (online, mail or fax) of submission.  No phone calls, please. Only applicants selected for interviews will be contacted.  Vera is an equal opportunity/affirmative action employer. All qualified applicants will be considered for employment without unlawful discrimination based on race, color, creed, national origin, sex, age, disability, marital status, sexual orientation, military status, prior record of arrest or conviction, citizenship status, current employment status, or caregiver status.  Vera works to advance justice, particularly racial justice, in an increasingly multicultural country and globally connected world. We value diverse experiences, including with regard to educational background and justice system contact, and depend on a diverse staff to carry out our mission.  For more information about Vera and CIJs work, please visit www.vera.org.  Powered by JazzHR",
 'Overview  Provides analytical and technical support for the integration of multiple data sources used to prepare internal and external reporting for the Quality Management team and business stakeholders. Provides support and analytical insight for Quality Incentive measures, HEDIS measures, and Quality Improvement initiatives. Monitors, analyzes, and communicates Quality performance related to benchmarks. Collaborates with clinical and operational teams within Quality Management, as well as with CHOICE Clinical Operations and Business Intelligence & Analytics (BIA). Participates in data validation of current reporting and dashboards. Monitors data integrity of databases and provides recommendation accordingly. Participates in the development of internal dashboards and databases. Works under general direction.  Responsibilities Provides support and analytical insight for Quality Incentive measures, HEDIS measures, and Quality Improvement initiatives. Monitors internal performance against benchmarks through analysis. Participates in the identification, development, management, and monitoring of quality improvement initiatives. Collaborates with Education staff and makes recommendations for areas of focus in training of assessors and care managers, based on analysis of performance trends. Researches and identifies technical/operational problems surrounding systems/applications; communicates/refers complex and unresolved problems to management, Business Intelligence & Analytics (BIA), and/or IT. Conducts ad hoc analyses to help identify operational gaps in care; drafts presentations, reports, publications, etc. regarding results of analyses. Communicates results of data analysis to non-technical audiences. Participates in prioritization of departmental goals based on identification of operational gaps in care. Participates in establishing data quality specifications and designs. Coordinates and supports integrated data systems for analyzing and validating information. Identifies and makes recommendations for reporting re-designs and platforms for reporting (e.g. automating a manual Excel file using macros, developing a MicroStrategy dashboard to replace manually updated Excel dashboards, moving data storage from Excel to Access, etc.), as needed. Trains staff on use of new/updated systems and related topics. Assists Quality management team with database and department reports. Conducts operations review and analysis of processes and procedures, issues report of findings and implements approved changes as required. Identifies and recommends software needs and applications to accomplish required reporting. Retrieves, compiles, reviews and ensures accuracy of data from databases; researches and corrects discrepancies, as needed. Analyzes data from internal and external sources. Identifies and resolves data quality issues before reports are generated. Works with staff to correct data entry errors. Analyzes data, identifies trends, reoccurring problems, statistically significant findings and prepares reports/summaries for management review. Acts as a liaison between Quality Management, CHOICE Clinical Operations, and BIA. Reviews and identifies trends and variances in data and reports. Researches findings and determines appropriateness of elevating identified issues to leadership for further review/evaluation/action. Monitors and maintains files by ensuring that files are current and of relevant nature. Analyzes and corrects error reports to ensure timely and accurate data; develops corrective actions to prevent errors where possible. Participates in special projects and performs other duties, as needed. Qualifications Education: Bachelors degree in bio/statistics, epidemiology, mathematics, computer science, social sciences, a related field or the equivalent work experience required. Masters degree with concentration in computer science, data science, or statistics preferred.   Experience: Minimum of two years experience performing increasingly complex data analysis and interpretation, preferably in a managed care or health care setting, required. Experience with data extraction and manipulation required. Experience with relational databases and programming experience in SQL or PL/SQL required. Experience with claims data and health plan quality metrics (e.g., HEDIS, QARR) preferred. Proficiency conducting statistical analysis with R, SAS, Stata or other statistical software preferred. Advanced personal computer skills, including Microsoft Word, PowerPoint, Excel, and Access required. Effective oral, written communication and interpersonal skills required. Ability to multi task in a fast-paced environment required.  CA2020',
 'We’re looking for a Senior Data Analyst who has a love of mentorship, data visualization, and generating actionable insights from raw data. In this role, you’ll have the opportunity to be an organizational influencer, who will generate insights with a good degree of autonomy, and partner with data science to grow deeper analytical skills. You will be joining the Insights & Analytics team, a team tasked with developing insights and reporting to support our customers and advisors’ needs. This team sits within the Customer Operations team, but is also connected to the Product organization.  In this role, you will work mainly with Customer Operations stakeholders to set KPIs and evaluate the effectiveness of current strategies and workflows. You will be involved in many aspects of data operations, from data auditing to building dashboards and analytical insights. For example, you will review the code of more junior analysts and organize coding workshops. You will build metrics to evaluate the performance of our advisors, eliminating confounding variables and creating weighted measures that account for individual success. You will define metrics and create dashboards to track the success of the current strategic direction. You will analyze the text-based interactions with our advisors to improve care quality. You will also analyze interactions with our chatbot to improve the responses of the bot to match customer expectations. You will collaborate with the Data Engineering team in pursuit of better data modeling and you can collaborate with the Data Science team on more modeling-heavy pursuits, like topic modeling or search recommendations. You will be a mentor to all other members of the analytics team, which demands that you excel in a culture of teaching and learning.  You will report to the Manager of the Insights & Analytics team, Customer Operations in Squarespace’s New York offices.  RESPONSIBILITIES Design and develop KPIs, reports, and dashboards using BI/data visualization tools that clearly track the effectiveness of current strategies Execute and communicate impactful analyses, using SQL and R/Python to improve the customer experience and drive Customer Operations and Product roadmap Raise the skill level of the analytics team through reviewing code, organizing training sessions, acting as a mentor, and introducing new analytical methods and tools Help design and evaluate experiments and pilots, setting clear hypotheses and success metrics, as well as using the right statistical methods for analyses Effectively partner with Customer Operations and Product stakeholders, as well as our Data Engineering team QUALIFICATIONS 4 or more years of experience as a data analyst, data scientist, or in a relevant field in which you have worked with large datasets to transform data to meaningful insights Mastery of SQL, and experience with R/Python Experience with dashboarding/BI tools like Chartio, Looker, Tableau Experience with data cleaning, analysis, and presentation to non-technical stakeholders Bachelor’s degree in a quantitative or logic-driven discipline Intellectual curiosity and a deep love of mentorship and generating actionable insights Preferred: Advanced degree (Master’s or PhD) in quantitative field Preferred: Experience with topic modeling, text analytics/natural language processing We are hiring at various experience levels and we’re particularly interested in having a diverse team with a broad set of skills and viewpoints. If this seems like an opportunity you’d like to explore, but you’re not sure if you qualify, we encourage you to apply anyway!  About Squarespace  Squarespace makes beautiful products to help people with creative ideas succeed. By blending elegant design and sophisticated engineering, we empower millions of people — from individuals and local artists to entrepreneurs shaping the world’s most iconic businesses — to share their stories with the world. Squarespace’s team of more than 900 is headquartered in downtown New York City, with offices in Dublin and Portland. For more information, visit www.squarespace.com/about.  Today, more than a million people around the globe use Squarespace to share different perspectives and experiences with the world. Not only do we embrace and celebrate the diversity of our customer base, but we also strive for the same in our employees. At Squarespace, we are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender, gender identity or expression, or veteran status. We are proud to be an equal opportunity workplace.',
 'Requisition NumberRR-0001939 Remote:Yes We collaborate. We create. We innovate.  Intrigued?  You’re a business professional with an innate curiosity that thrives in a dynamic and Agile environment. You appreciate teamwork, exemplify integrity, perseverance, flexibility, and a generosity of spirit… if this sounds like you, then please apply – we’d love to meet you!  Celerity is expanding and on the hunt for the savvy, creative, and analytically sound individuals that are motivated by solving complex problems. We’re in the business of transforming how people, process, and systems co-exist, while improving operational efficiencies and user-driven interactions. We work with groundbreaking companies, melding expertise in Digital Strategy, Technology, Creative, and Business Transformation. The health and safety of our employees is our top priority. Due to the pandemic, all our employees are working remotely and we will be conducting candidate interviews by video. This position will continue to be remote once the COVID-19 crisis has abated.  What You’ll Do  • Perform data visualization and comparisons across multiple data sources / systems of record to provide value-added insights to support critical business operations teams • Build scripts to perform low-level analysis and data investigation working in conjunction with other Celerity or client team members to improve data accuracy and make more informed decisions • Support movement of large amounts of data and existing processes into new ecosystems including looking for opportunities for greater automation of existing processes via SQL • Produce customized reports and visualizations as well as other ad-hoc analysis in support of time-sensitive business requests • Demonstrate flexibility to expand beyond a pure data analysis role including upstream requirements gathering or downstream process recommendations/implementations centered around a strong understanding of client data  About You  • Bachelor’s Degree in Business, Management, or a quantitative field 3+ years SQL development of complex queries • 2+ years professional experience • Expertise designing and implementing ETL processes You have Teradata SQL Experience (preferred) • Experience with Data Visualization such as Tableau or PowerBI • You understand and have experience with multidimensional analysis and queries • Strong communication skills • You have an analytical mindset to be able to ask probing questions to eradicate ambiguity and are willing to point out flaws in logic and identify better ways to do things • You have “Good Data Intuition” • You are not an order taker, but an analyst that can sniff out trends, identify patterns, and answer open ended questions • Willingness to expand scope of work beyond data to include business analyst, process analyst, or project manager type activities as necessary We Are Celerity  Millions of people every day use websites, applications, and business processes designed and built by Celerity. As a consultancy of technologists, creatives, and business experts, we provide the action plans, teams, and solutions our clients need to transform the way they do business and deepen engagement with their customers.  Celerity empowers autonomy and accountability. We understand life happens and want our team to do what is needed to get the work done, enjoy it along the way, and have plenty of time and energy for family and friends, vacations, hobbies, learning of languages, you name it! At Celerity, our focus is on bringing synergies to business process, technology, and creative chops to deliver inspirational solutions. We love giving kudos and celebrating achievements. We provide an inclusive and flexible environment that fosters collaboration, quick pivots, and quality work.  Originally founded in 2002, Celerity was acquired in 2015 by AUSY, a France-based IT consultancy and engineering services firm. With 8 regional offices and 400+ employees, Celerity’s clients span a variety of industries including media and entertainment, healthcare, financial services, manufacturing, non-profit, and hospitality.  You can also get to know us better on Instagram, Facebook, Dribbble, LinkedIn, and Twitter.  Build Your Career  At Celerity, our professional growth plan focuses on individualizing everyone’s track through incorporating personal interests with training and mentorship, while further expanding existing strong suits. We recognize in this ever-evolving digital world, there is extreme value in learning new skills and technologies. Celerity provides opportunities to shape skill sets, work with an amazing team of industry-leading consultants and exciting clients, in addition to professional trainings and certifications that best align with your career goals. We are proud to be an Equal Opportunity Employer. All employment decisions shall be made without regard to age, race, creed, color, religion, sex, national origin, ancestry, disability status, veteran status, sexual orientation, gender identity or expression, genetic information, marital status, citizenship status or any other basis as protected by federal, state, or local law. Celerity is committed to providing veteran employment opportunities to our service men and women.',
 "ABOUT FANDUEL GROUP  FanDuel Group is a world-class team of brands and products all built with one goal in mind — to give fans new and innovative ways to interact with their favorite games, sports, teams, and leagues. That's no easy task, which is why we're so dedicated to building a winning team. And make no mistake, we are here to win, but we believe in winning right. That means we'll never compromise when it comes to looking out for our teammates. From our many opportunities for professional development to our generous insurance and paid leave policies, we're committed to making sure our employees get as much out of FanDuel as we ask them to give.  FanDuel Group is based in New York, with offices in California, New Jersey, Florida, Oregon and Scotland. Our brands include: FanDuel — A game-changing real-money fantasy sports app FanDuel Sportsbook — America's #1 sports betting app TVG — The best-in-class horse racing TV/media network and betting platform FanDuel Racing — A horse racing app built for the average sports fan FanDuel Casino & Betfair Casino — Fan-favorite online casino apps FOXBet — A world-class betting platform PokerStars — The premier online poker product THE POSITION Our roster has an opening with your name on it  We are looking for a Reporting Data Analyst to join our growing Analytics team working with all aspects of regulatory and compliance reporting. You will work with Engineering, Compliance and regulatory stakeholders to ensure that reports produce accurate, timely results.  THE GAME PLAN Everyone on our team has a part to play Work with internal and external stakeholders to specify requirements for regulatory and compliance reports Use your experience with SQL to design and develop reports in a number of databases Collaborate with Engineering to develop databases in support of current and future reporting needs Take ownership of the schedule of reports, investigating and remediating any issues related to accuracy or availability Identifying and acting on opportunities to further improve regulatory and compliance reporting Supporting internal stakeholders in Compliance, Legal and other departments with data analysis THE STATS What we're looking for in our next teammate Highly numerate Bachelor's degree 1-2 years' experience working with data, reporting and analysis Strong technical and analytic skills – SQL and Python are essential, Excel is nice to have Experience using automation and visualization tools to deliver information to a range of stakeholders Experience of analyzing and manipulating large data sets across multiple data sources Desire to learn new technical skills and practice continuous personal development THE CONTRACT We treat our team right  Competitive compensation is just the beginning. As part of our team, you can expect: An exciting and fun environment committed to driving real growth Opportunities to build really cool products that fans love Mentorship and professional development resources to help you refine your game Flexible vacation allowance to let you refuel Hall of Fame benefit programs and platforms FanDuel Group is an equal opportunities employer. Diversity and inclusion in FanDuel means that we respect and value everyone as individuals. We don't tolerate bias, judgement or harassment. Our focus is on developing employees so that they reach their full potential."]
In [ ]:
# fitting the Top2Vec model using USE embeddings on the raw data analyst text
# top2vec_da_raw = Top2Vec(da_raw, 
#                          speed = 'deep-learn', 
#                          embedding_model = 'universal-sentence-encoder')

# saving raw Top2Vec model
# top2vec_da_raw.save(root_path + "/data/top2vec_da_raw")
2021-03-14 20:55:51,176 - top2vec - INFO - Pre-processing documents for training
2021-03-14 20:55:55,364 - top2vec - INFO - Downloading universal-sentence-encoder model
2021-03-14 20:56:15,486 - top2vec - INFO - Creating joint document/word embedding
INFO:top2vec:Creating joint document/word embedding
2021-03-14 20:56:22,402 - top2vec - INFO - Creating lower dimension embedding of documents
INFO:top2vec:Creating lower dimension embedding of documents
2021-03-14 20:56:41,955 - top2vec - INFO - Finding dense areas of documents
INFO:top2vec:Finding dense areas of documents
2021-03-14 20:56:42,054 - top2vec - INFO - Finding topics
INFO:top2vec:Finding topics
In [ ]:
# loading raw Top2Vec model
top2vec_da_raw = Top2Vec.load(root_path + "/data/top2vec_da_raw")
In [ ]:
# getting number of topics
top2vec_da_raw.get_num_topics()
Out[ ]:
2
In [ ]:
# getting top 5 topics
topic_words_da_raw, word_scores_da_raw, topic_nums_da_raw = top2vec_da_raw.get_topics(2)
In [ ]:
# plotting word clouds for the top 5 topics
for topic in topic_nums_da_raw:
  top2vec_da_raw.generate_topic_wordcloud(topic)
Business Analyst
In [ ]:
ba_raw = list(data[data['JobType'] == 'Business Analyst']['JobDescription'].values)
ba_raw[:5]
Out[ ]:
["Company Overview   At Memorial Sloan Kettering (MSK), we’re not only changing the way we treat cancer, but also the way the world thinks about it. By working together and pushing forward with innovation and discovery, we’re driving excellence and improving outcomes. For the 28th year, MSK has been named a top hospital for cancer by U.S. News & World Report. We are proud to be on Becker’s Healthcare list as one of the 150 Great Places to Work in Healthcare in 2018, as well as one of Glassdoor’s Employees’ Choice Best Place to Work for 2018. We’re treating cancer, one patient at a time. Join us and make a difference every day.  Job Description   We are excited to recruit two Business Analyst(s) to join our digital, informatics and technology organization who will measure success on our high priority platforms supported by the Division of Health Informatics. DHI focuses on the interaction of people, processes, and technology enabling MSK to meet critical patient care and research objectives. Our division manages a suite of clinical information systems including the electronic health record, ancillary systems and other core components that enable decision support, health information exchange and knowledge management.  About the Clinical and Logistics Platform:MSK has committed to transforming itself into a digital organization. To help advance the transformation, MSK has committed to the Scaled Agile Framework (SAFe) as a method of managing technology work. As part of our digital transformation, MSK has identified a set of platforms that it will be creating to advance its digital transformation.  One of these high priority platforms is the Clinical and Logistics Platform, which is part of MSK’s digital ecosystem. The focus of the Clinical and Logistics Platform is to enable our Care Teams and Health Care Providers with a pre-integrated platform comprising of all our clinical, logistics, and communication systems. We aim to provide a world-class user experience, reduce friction and burnout to our Care Teams, and empower the Care Team members to provide the best patient experience.  You will: Support agile team in translating end user requirements into set of user product criteria for technical development team & developing documentation Connect with team to confirm product commitments Conduct data analysis and research required for development team to make decisions Collaborate with team to understand, analyze, document, and communicate platform requirements Partner with key stakeholders to articulate and refine product vision throughout the iteration process Help design team in translating product needs/requirements into deliverable units of work Able to synthesize metrics such as: user satisfaction/adoption, quality of products delivered, time-to-market, timely delivery etc. Collaborates closely with agile teams to translate business requirements and operational metrics between user groups and developers Provide definitions for product criteria, readiness, completeness etc. You have: 2-4 years of experience in a healthcare setting in a role such as data/business analysis Bachelor's degree in Health Informatics, Computer Science, Information Systems or related SAFe certification or exposure to agile methodologies a strong plus Strong problem-solving and analytical skills, with ability to synthesize user needs and pull insights out of data Excellent communicator able to negotiate with business to clarify needs vs. wants/likes to balance with development cost and risk Ability to break down and describe business needs into work supporting an iterative and incremental delivery model Credibility with both business leaders, product designers, & developers #LI-POST  Benefits  Competitive compensation packages | Sick Time | Generous Vacation + 12 holidays to recharge & refuel | Internal Career Mobility & Performance Consulting | Medical, Dental, Vision, FSA & Dependent Care | 403b retirement savings plan match | Tuition Reimbursement | Parental Leave & Adoption Assistance | Commuter Spending Account | Fitness Discounts & Wellness Program | Resource Networks | Life Insurance & Disability | Remote Flexibility  We believe in communication, transparency, and thinking beyond your 8-hour day @ MSK. It’s important to us that you have a sense of impact, community, and work/life balance to be and feel your best.  Our Hiring Process  You read the ad, agree it sounds like a great fit & apply -> Talent Acquisition contacts you to schedule a phone interview (if your profile aligns)-> after speaking with the Talent Acquisition Specialist, you will connect with the Hiring Manager by phone or video call -> if your experience is a fit, you will move forward to an on-site visit or video call with the team -> post interview feedback -> ideally an offer! -> reference check & onboarding -> orientation & official welcome to MSK  MSK has been named a Best Place to Work in IT by Insider Pro & Computerworld, for the 2nd year in a row.  We look forward to meeting soon!  Closing   MSK is an equal opportunity and affirmative action employer committed to diversity and inclusion in all aspects of recruiting and employment. All qualified individuals are encouraged to apply and will receive consideration without regard to race, color, gender, gender identity or expression, sexual orientation, national origin, age, religion, creed, disability, veteran status or any other factor which cannot lawfully be used as a basis for an employment decision.  Federal law requires employers to provide reasonable accommodation to qualified individuals with disabilities. Please tell us if you require a reasonable accommodation to apply for a job or to perform your job. Examples of reasonable accommodation include making a change to the application process or work procedures, providing documents in an alternate format, using a sign language interpreter, or using specialized equipment.",
 'We are seeking for an energetic and collaborative analyst with experience and passion to fill our Business Analyst position. In this role you will be responsible for leading our Data & Analytics function. This is a cross-functional role where you will work with various areas (Investor Relations, Portfolio Excellence, Human Resource and Business Development) collecting and analyzing financial and operational data of our Private Equity and Portfolio Companies. If you enjoy working in a highly collaborative, analytical, fast-paced and dynamic environment, this is the right opportunity for you! The ideal candidate will be highly skilled in all aspects of data analytics, including mining, generation and visualization. The role is committed to transforming data into readable, goal-driven reports for investors and business leaders. Position Summary Essential Job Functions: To be the ‘data master’ for key IT platforms that collect financial, operating and contact management data across the portfolio of companies that Paine Schwartz owns and network of contacts Responsible for the timeliness and integrity of data in CRM (Salesforce.com) and finance/operating data systems (iLevel) and likely other data in the future Actively participate in innovative projects to improve the Data Quality or master Data Management within the Firm’s data ecosystem Responsible for assisting with the maintenance and updating of core firm marketing material, as well as supporting the Investor Relations & Business Development functions on specific presentation workstreams Will support the Human Capital, Investor Relations, Business Development and Portfolio Operations function with insightful and useful reporting and analysis from these systems Will have a career path in the exciting Private Equity industry with options to progress into more Financial or other functional responsibilities Candidate Requirement Required Skills: Excellent excel, database and PPT skills Understanding of income statements, balance sheets and cash flow reports and how they tie together Good people management skills, ability to follow-up and ensure data is received timely and correctly in a professional manner Extreme attention to detail Ability to take initiative and work independently, while demonstrating strong teamwork Must be able to work in a fast-paced environment, manage time and priorities under pressure and meet deadlines Specific Technology platform requirements: Salesforce.com (experience with this specific CRM tool is heavily desired): Create custom objects, design page layouts, adjust fields and create custom reports. Ability to use Salesforce data loader and Conga reporting solutions Understanding of Pardot or other email marketing software ILEVEL (we don’t require experience with this specific portal/finance tool but experience with ERP and financial platforms is desired). Ability to map data from financial statements. Qualifications: Undergraduate degree, preferably finance, economics, business or accounting related Minimum two years of experience in a professional business context involving data management, analysis and reporting',
 "For more than a decade, Asembia has been working with specialty pharmacies, manufacturers, prescribers, payers and other industry stakeholders to develop solutions for the high-touch specialty pharmaceutical service model.Through collaborative programs, contracting initiatives, patient support hub services and innovative technology platforms, Asembia is committed to positively impacting the patient journey.  Asembia focuses on the specialty pharmacy segment and offers comprehensive hub services, pharmacy network management, group purchasing (GPO) services, innovative technology platforms and more.As a leading industry voice and advocate, Asembia is committed to bringing strategic channel management solutions, leading-edge products and high-touch services to the specialty pharmacy industry that help our customers optimize patient care and outcomes.  Primary Function:  The Technical Analystis responsible for implementing data solutions for varied data reporting needs and act as a Subject Matter Expert (SME) for technical data product offerings.  Job Scope and Major Responsibilities:  ·Follow all policies and internal processes to ensure data programs and files are delivered in accordance with the provisions of the Health Insurance Portability and Accountability Act of 1996 and its implementing regulations, as amended (“HIPAA”).  ·Design, develop and validate data reporting solutions with minimum supervision  ·Evaluate data and report findings for accuracy, completeness, per the data specifications.  ·Responsible for monitoring file/data receipt per expected schedules. Identify/escalate/ resolve nonconformances.  ·Meet data reporting schedules to external stakeholders/ parties for designated data programs.  ·Seek and adopt best practices in data reporting, make recommendations for data processing improvements. Identify opportunities to improve the process and data accuracy.  ·Provide ad hoc data analysis to internal and external parties as requested.  · Successful collaboration with other Asembia team members to attain a common goal.  Qualifications:  ·Bachelor’s Degree required. B.A./B.S. Information Systems or similar field preferred.  ·Minimum 1 to 3 years of experience working in data analytics or Software Quality assurance – preference towards Prescription / pharmaceutical data.  ·Must have a working knowledge of SQL.  ·Advance level skills in developing solutions using SQL and SSIS preferred, not required.  ·Must be detail oriented with the ability focus on complex processing steps.  ·Proven track record with taking ownership in data analytics or Software QA analysis.  ·Ability to manage multiple initiatives in a fast paced, dynamic environment.  ·Strong self-starter with a sense of ownership.  ·Excellent organizational and collaboration skills.  Asembia is committed to Equal Employment Opportunity (EEO) and to compliance with all Federal, State and local laws that prohibit employment discrimination on the basis of race, color, age, natural origin, ethnicity, religion, gender, pregnancy, marital status, sexual orientation, gender identity and expression, citizenship, genetic disposition, disability or veteran's status or any other classification protected by State/Federal laws.",
 'Job Description Summary The Information Security Analyst will be a member of the BD Security Operations group collaborating with Incident Response and other members to improve the security of BD. This person will further the adoption of the corporate Information Security framework within the Operations group, respond to alerts and conduct forensic analysis, in addition to project support. Job Description   The Information Security Analyst will be a member of the BD Security Operations group collaborating with Incident Response and other members to improve the security of BD. This person will further the adoption of the corporate Information Security framework within the Operations group, respond to alerts and conduct forensic analysis, in addition to project support.  Responsibilities: Ensure the response to security incidents, alerts and events Track and report operations monitoring and alerting Contribute to cross-functional collaboration of Operations initiatives, including working with Privacy, Legal, and Business LT on response actions Work with teams to ensure projects are meeting objectives and deadlines May perform other duties as required Primary Work Location USA NJ - Franklin Lakes Additional Locations Work Shift',
 "Magnite is the world's largest independent sell-side platform. We were built by combining Rubicon Project's programmatic expertise and Telaria's talents in CTV. We help publishers sell advertising on their terms and connect with buyers across every channel and format. We believe in keeping tech transparent, solutions collaborative, and guidance unconflicted. The ticker symbol for the newly formed company will be NASDAQ: MGNI.  This position will be responsible for supporting the - Global Sales Organization by generating actionable insights through analysis of financial, operational, and industry trends, and summarizing key items in reporting/dashboards for leadership.  What you'll be doing:  Operational and Financial Planning Revenue planning and analysis Monitor revenue and metrics on a daily/weekly/monthly basis and escalate potential concerns Create and maintain dashboards and visualizations of financial and operational data Perform business case modeling including assessing the impact to revenue/growth rates and profitability Provide recommendations to Sales to help achieving revenue objectives Financial forecasting including full P/L and/or contribution margin Measure business performance across multiple regions, product offerings, and categories Managing Ad hoc analysis requests from Sales Financial Analysis/Management Reporting Financial analysis and reporting of key revenue streams, including variance analysis of actuals versus budget/forecast/prior year Benchmark the Rubicon Project revenue performance against industry expectations and competition Maintain and enhance management reporting systems in order to manage, target, and track the performance of the Revenue/Product teams, including partnering with Accounting, Financial Systems, and Business Intelligence teams to define reporting requirements Development of tactical and strategic insights based on results of data analyses Ad hoc analysis and modeling on corporate initiatives What we are looking for: Minimum of 1-2 years of work experience, preferably in FP&A within a technology company, management consulting, investment banking, and/or operational finance roles Exceptional analytical skills and conceptual thinking ability Demonstrated success in managing large modeling efforts involving complex business problems, large data sets, and statistical analysis Experience with management reporting and creating executive summaries Strong oral and written communication skills Ability and desire to work in a fast-paced, fun, demanding environment Digital advertising/adtech experience a plus Undergraduate degree in Finance, Accounting, or quantitative related field with relevant financial work experience Software Expertise Strong proficiency in Excel with demonstrated ability of use of advanced functions such as pivot tables, lookup functions, and advanced charting techniques Proficient in Google Docs, along with PowerPoint, and Word Microsoft BI experience a plus MicroStrategy or Tableau experience a plus Experience building and using financial models Familiarity/capacity to learn database/query tools/techniques Experience with centralized planning/reporting tools a plus Perks/Benefits Career growth opportunities: We encourage you to carve your own path across the organization and provide opportunities to grow professionally Hungry?: Each Rubicon Project office offers free daily lunches daily and a fully stocked kitchen with healthy snacks. Take time for yourself: We offer an unlimited vacation policy and encourage you to refresh yourself as you need. We also close down the last two weeks of the year for a paid Holiday Break. 401k Match: We offer an unique 401K match program with a variety of tax break benefits Stay healthy: Choose from a variety of low cost medical, dental and vision plans to cover you and your loved ones with a multitude of options. In addition, we offer Basic Life and Disability Coverage provided at no cost to you Perks: Discounts to major name brand items, Travel benefit options, plus much more!"]
In [ ]:
# fitting the Top2Vec model using USE embeddings on the raw job description text
# top2vec_ba_raw = Top2Vec(ba_raw, 
#                          speed = 'deep-learn', 
#                          embedding_model = 'universal-sentence-encoder')

# saving raw Top2Vec model
# top2vec_ba_raw.save(root_path + "/data/top2vec_ba_raw")
2021-03-14 21:01:02,894 - top2vec - INFO - Pre-processing documents for training
2021-03-14 21:01:09,639 - top2vec - INFO - Downloading universal-sentence-encoder model
2021-03-14 21:01:27,331 - top2vec - INFO - Creating joint document/word embedding
INFO:top2vec:Creating joint document/word embedding
2021-03-14 21:01:39,216 - top2vec - INFO - Creating lower dimension embedding of documents
INFO:top2vec:Creating lower dimension embedding of documents
2021-03-14 21:02:16,142 - top2vec - INFO - Finding dense areas of documents
INFO:top2vec:Finding dense areas of documents
2021-03-14 21:02:16,335 - top2vec - INFO - Finding topics
INFO:top2vec:Finding topics
In [ ]:
# loading raw Top2Vec model
top2vec_ba_raw = Top2Vec.load(root_path + "/data/top2vec_ba_raw")
In [ ]:
# getting number of topics
top2vec_ba_raw.get_num_topics()
Out[ ]:
9
In [ ]:
# getting top 5 topics
topic_words_ba_raw, word_scores_ba_raw, topic_nums_ba_raw = top2vec_ba_raw.get_topics(3)
In [ ]:
# plotting word clouds for the top 5 topics
for topic in topic_nums_ba_raw:
  top2vec_ba_raw.generate_topic_wordcloud(topic)

Job Title - Neural Networks

In [ ]:
# importing TensorFlow
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras import layers
from keras.layers import Dropout
from keras.models import Model
In [ ]:
# Define model output visulization
def plot_history(history):
    acc = history.history['accuracy']
    val_acc = history.history['val_accuracy']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    x = range(1, len(acc) + 1)

    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    plt.plot(x, acc, 'b', label='Training acc')
    plt.plot(x, val_acc, 'r', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.subplot(1, 2, 2)
    plt.plot(x, loss, 'b', label='Training loss')
    plt.plot(x, val_loss, 'r', label='Validation loss')
    plt.title('Training and validation loss')
    plt.legend()
In [ ]:
# Create copy dataframe for neural network, encoding and splitting data
nn_df = data[['JobTitle','JobDescription','JobType','SalaryAvg']]
X = nn_df.drop('JobType',axis=1)
y = nn_df['JobType']
encoder = LabelEncoder()
encoder.fit(y)
encoded_y = encoder.transform(y)
X_train_title, X_test_title, y_train_true, y_test_true = train_test_split(X['JobTitle'].values, encoded_y, test_size = 0.2, random_state = 42)
In [ ]:
# Categorize target variable to be applied in sequential model
y_train = tf.keras.utils.to_categorical(y_train_true, 3)
y_test = tf.keras.utils.to_categorical(y_test_true, 3)
In [ ]:
# Applied GloVe word embeddings 
# We would only focus on limited number of words in GloVe vocabulary and skip some words
def create_embedding_matrix(filepath, word_index, embedding_dim):
    vocab_size = len(word_index) + 1  # Adding 1 because of reserved 0 index
    embedding_matrix = np.zeros((vocab_size, embedding_dim))

    with open(filepath) as f:
        for line in f:
            word, *vector = line.split()
            if word in word_index:
                idx = word_index[word] 
                embedding_matrix[idx] = np.array(
                    vector, dtype=np.float32)[:embedding_dim]

    return embedding_matrix
In [ ]:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=50)
tokenizer.fit_on_texts(X_train_title)
X_train_title = tokenizer.texts_to_sequences(X_train_title)
X_test_title = tokenizer.texts_to_sequences(X_test_title)

maxlen_title = 50
X_train_title = pad_sequences(X_train_title, padding='post', maxlen=maxlen_title)
X_test_title = pad_sequences(X_test_title, padding='post', maxlen=maxlen_title)

embedding_dim = 100
embedding_matrix_title = create_embedding_matrix('/content/drive/Shareddrives/Data Mining Group Project/data/glove.6B.100d.txt',tokenizer.word_index, embedding_dim)

nonzero_elements = np.count_nonzero(np.count_nonzero(embedding_matrix_title, axis=1))
vocab_size_title = len(tokenizer.word_index) + 1
nonzero_elements / vocab_size_title
Out[ ]:
0.8872979214780601
In [ ]:
model = Sequential()
model.add(layers.Embedding(vocab_size_title, embedding_dim,
                           weights=[embedding_matrix_title], 
                           input_length=maxlen_title, 
                           trainable=True))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(25, activation='relu'))
model.add(Dropout(0.2))
model.add(layers.Dense(3, activation='softmax'))
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 50, 100)           216500    
_________________________________________________________________
global_max_pooling1d (Global (None, 100)               0         
_________________________________________________________________
dense (Dense)                (None, 25)                2525      
_________________________________________________________________
dropout (Dropout)            (None, 25)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 3)                 78        
=================================================================
Total params: 219,103
Trainable params: 219,103
Non-trainable params: 0
_________________________________________________________________
In [ ]:
history = model.fit(X_train_title, y_train,
                    epochs=20,
                    verbose=False,
                    validation_data=(X_test_title, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train_title, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test_title, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
plot_history(history)
Training Accuracy: 0.8753
Testing Accuracy:  0.8688
In [ ]:
# Check Claasification Report for job title
result_train = model.predict(X_train_title)
pr_train = [np.argmax(x) for x in result_train]
result_test = model.predict(X_test_title)
pr_test = [np.argmax(x) for x in result_test]
In [ ]:
print(classification_report(y_train_true, pr_train))
print(classification_report(y_test_true, pr_test))
              precision    recall  f1-score   support

           0       0.98      0.90      0.94      3258
           1       0.66      0.98      0.79      1788
           2       0.99      0.79      0.88      3157

    accuracy                           0.88      8203
   macro avg       0.87      0.89      0.87      8203
weighted avg       0.91      0.88      0.88      8203

              precision    recall  f1-score   support

           0       0.97      0.89      0.93       834
           1       0.66      0.98      0.79       465
           2       0.98      0.77      0.86       752

    accuracy                           0.87      2051
   macro avg       0.87      0.88      0.86      2051
weighted avg       0.90      0.87      0.87      2051

Job Descripiton - Neural Networks

In [ ]:
# With the same random state value, we don't need to create new target variable again
X_train, X_test = train_test_split(X['JobDescription'].values, test_size = 0.2, random_state = 42)
In [ ]:
# Tokenize words based on word values from the dictionary tokenizer.word_index
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)
# After processing above, we have text sequences that in most cases different length of words. Handle the length issue as below: pad sequence of words with zeros
maxlen = 800
X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)
In [ ]:
# Retrieve the embedding matrix
embedding_dim = 100
embedding_matrix = create_embedding_matrix('/content/drive/Shareddrives/Data Mining Group Project/data/glove.6B.100d.txt',tokenizer.word_index, embedding_dim)
In [ ]:
# Check how many embedding vectors are nonzero
nonzero_elements = np.count_nonzero(np.count_nonzero(embedding_matrix, axis=1))
vocab_size = len(tokenizer.word_index) + 1
nonzero_elements / vocab_size
Out[ ]:
0.5916031452136485
In [ ]:
# Initialize sequential model with pretrained weights from GloVe
maxlen = 800
model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim,
                           weights=[embedding_matrix], 
                           input_length=maxlen, 
                           trainable=True))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(50, activation='relu'))
model.add(Dropout(0.2))
model.add(layers.Dense(25, activation='relu'))
model.add(Dropout(0.2))
model.add(layers.Dense(3, activation='softmax'))
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 800, 100)          4158700   
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 100)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 50)                5050      
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 25)                1275      
_________________________________________________________________
dropout_2 (Dropout)          (None, 25)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 3)                 78        
=================================================================
Total params: 4,165,103
Trainable params: 4,165,103
Non-trainable params: 0
_________________________________________________________________
In [ ]:
# Fit the model with epochs as 20 and plot history performance of the model
history = model.fit(X_train, y_train,
                    epochs=20,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
plot_history(history)
Training Accuracy: 0.8522
Testing Accuracy:  0.7465
In [ ]:
# Get prediction from the model
result_train = model.predict(X_train)
pr_train = [np.argmax(x) for x in result_train]
result_test = model.predict(X_test)
pr_test = [np.argmax(x) for x in result_test]
In [ ]:
# Check classification results
print(classification_report(y_train_true, pr_train))
print(classification_report(y_test_true, pr_test))
              precision    recall  f1-score   support

           0       0.99      0.86      0.92      3258
           1       0.61      0.97      0.75      1788
           2       0.97      0.78      0.86      3157

    accuracy                           0.85      8203
   macro avg       0.86      0.87      0.84      8203
weighted avg       0.90      0.85      0.86      8203

              precision    recall  f1-score   support

           0       0.85      0.76      0.80       834
           1       0.51      0.78      0.61       465
           2       0.90      0.72      0.80       752

    accuracy                           0.75      2051
   macro avg       0.75      0.75      0.74      2051
weighted avg       0.79      0.75      0.76      2051

Job Title + Salary - Neural Networks

In [ ]:
# Define model output visulization for featrue with salary
def plot_history_salary(history):
    acc = history.history['accuracy'][1:]
    val_acc = history.history['val_accuracy'][1:]
    loss = history.history['loss'][1:]
    val_loss = history.history['val_loss'][1:]
    x = range(1, len(acc) + 1)

    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    plt.plot(x, acc, 'b', label='Training acc')
    plt.plot(x, val_acc, 'r', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.subplot(1, 2, 2)
    plt.plot(x, loss, 'b', label='Training loss')
    plt.plot(x, val_loss, 'r', label='Validation loss')
    plt.title('Training and validation loss')
    plt.legend()
In [ ]:
# With the same random state value, combine SalaryAvg to the model
X_train2, X_test2, = train_test_split(X['SalaryAvg'].values, test_size = 0.2, random_state = 42)
In [ ]:
input_1 = layers.Input(shape=(maxlen_title,))
input_2 = layers.Input(shape=(1,))
embedding_layer = layers.Embedding(vocab_size_title, embedding_dim,
                           weights=[embedding_matrix_title],  
                           trainable=True)(input_1)

globalpool = layers.GlobalMaxPool1D()(embedding_layer)

dense_layer_1 = layers.Dense(5, activation='relu')(input_2)
# dense_layer_2 = Dense(10, activation='relu')(dense_layer_1)
concat_layer = layers.Concatenate()([globalpool, dense_layer_1])
dense_layer_3 = layers.Dense(50, activation='relu')(concat_layer)
dropout_1 = layers.Dropout(0.2)(dense_layer_3)
dense_layer_4 = layers.Dense(25, activation='relu')(dropout_1)
dropout_2 = layers.Dropout(0.2)(dense_layer_4)
output = layers.Dense(3, activation='softmax')(dropout_2)
model = Model(inputs=[input_1, input_2], outputs=output)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            [(None, 50)]         0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 50, 100)      216500      input_1[0][0]                    
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 1)]          0                                            
__________________________________________________________________________________________________
global_max_pooling1d_2 (GlobalM (None, 100)          0           embedding_2[0][0]                
__________________________________________________________________________________________________
dense_5 (Dense)                 (None, 5)            10          input_2[0][0]                    
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 105)          0           global_max_pooling1d_2[0][0]     
                                                                 dense_5[0][0]                    
__________________________________________________________________________________________________
dense_6 (Dense)                 (None, 50)           5300        concatenate[0][0]                
__________________________________________________________________________________________________
dropout_3 (Dropout)             (None, 50)           0           dense_6[0][0]                    
__________________________________________________________________________________________________
dense_7 (Dense)                 (None, 25)           1275        dropout_3[0][0]                  
__________________________________________________________________________________________________
dropout_4 (Dropout)             (None, 25)           0           dense_7[0][0]                    
__________________________________________________________________________________________________
dense_8 (Dense)                 (None, 3)            78          dropout_4[0][0]                  
==================================================================================================
Total params: 223,163
Trainable params: 223,163
Non-trainable params: 0
__________________________________________________________________________________________________
In [ ]:
history = model.fit(x=[X_train_title,X_train2], y=y_train,
                    epochs=20,
                    verbose=False,
                    validation_data=([X_test_title,X_test2], y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(x=[X_train_title,X_train2], y=y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(x=[X_test_title,X_test2], y=y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
plot_history_salary(history)
Training Accuracy: 0.8742
Testing Accuracy:  0.8723
In [ ]:
result_train = model.predict([X_train_title,X_train2])
pr_train = [np.argmax(x) for x in result_train]
result_test = model.predict([X_test_title,X_test2])
pr_test = [np.argmax(x) for x in result_test]
In [ ]:
print(classification_report(y_train_true, pr_train))
print(classification_report(y_test_true, pr_test))
              precision    recall  f1-score   support

           0       0.98      0.90      0.94      3258
           1       0.65      1.00      0.79      1788
           2       1.00      0.78      0.88      3157

    accuracy                           0.87      8203
   macro avg       0.87      0.89      0.87      8203
weighted avg       0.91      0.87      0.88      8203

              precision    recall  f1-score   support

           0       0.97      0.90      0.93       834
           1       0.66      0.99      0.79       465
           2       0.99      0.77      0.87       752

    accuracy                           0.87      2051
   macro avg       0.88      0.89      0.86      2051
weighted avg       0.91      0.87      0.88      2051

Job Description + Salary - Neural Networks

In [ ]:
input_1 = layers.Input(shape=(maxlen,))
input_2 = layers.Input(shape=(1,))
embedding_layer = layers.Embedding(vocab_size, embedding_dim,
                           weights=[embedding_matrix],  
                           trainable=True)(input_1)

globalpool = layers.GlobalMaxPool1D()(embedding_layer)

dense_layer_1 = layers.Dense(5, activation='relu')(input_2)
# dense_layer_2 = Dense(10, activation='relu')(dense_layer_1)
concat_layer = layers.Concatenate()([globalpool, dense_layer_1])
dense_layer_3 = layers.Dense(50, activation='relu')(concat_layer)
dropout_1 = layers.Dropout(0.2)(dense_layer_3)
dense_layer_4 = layers.Dense(25, activation='relu')(dropout_1)
dropout_2 = layers.Dropout(0.2)(dense_layer_4)
output = layers.Dense(3, activation='softmax')(dropout_2)
model = Model(inputs=[input_1, input_2], outputs=output)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_3 (InputLayer)            [(None, 800)]        0                                            
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 800, 100)     4158700     input_3[0][0]                    
__________________________________________________________________________________________________
input_4 (InputLayer)            [(None, 1)]          0                                            
__________________________________________________________________________________________________
global_max_pooling1d_3 (GlobalM (None, 100)          0           embedding_3[0][0]                
__________________________________________________________________________________________________
dense_9 (Dense)                 (None, 5)            10          input_4[0][0]                    
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 105)          0           global_max_pooling1d_3[0][0]     
                                                                 dense_9[0][0]                    
__________________________________________________________________________________________________
dense_10 (Dense)                (None, 50)           5300        concatenate_1[0][0]              
__________________________________________________________________________________________________
dropout_5 (Dropout)             (None, 50)           0           dense_10[0][0]                   
__________________________________________________________________________________________________
dense_11 (Dense)                (None, 25)           1275        dropout_5[0][0]                  
__________________________________________________________________________________________________
dropout_6 (Dropout)             (None, 25)           0           dense_11[0][0]                   
__________________________________________________________________________________________________
dense_12 (Dense)                (None, 3)            78          dropout_6[0][0]                  
==================================================================================================
Total params: 4,165,363
Trainable params: 4,165,363
Non-trainable params: 0
__________________________________________________________________________________________________
In [ ]:
history = model.fit(x=[X_train,X_train2], y=y_train,
                    epochs=20,
                    verbose=1,
                    validation_data=([X_test,X_test2], y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(x=[X_train,X_train2], y=y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(x=[X_test,X_test2], y=y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
plot_history_salary(history)
Epoch 1/20
821/821 [==============================] - 35s 41ms/step - loss: 1146.9949 - accuracy: 0.3653 - val_loss: 1.0712 - val_accuracy: 0.4066
Epoch 2/20
821/821 [==============================] - 34s 41ms/step - loss: 1.0711 - accuracy: 0.4057 - val_loss: 1.0469 - val_accuracy: 0.6119
Epoch 3/20
821/821 [==============================] - 33s 40ms/step - loss: 1.0127 - accuracy: 0.5179 - val_loss: 0.7803 - val_accuracy: 0.6685
Epoch 4/20
821/821 [==============================] - 33s 40ms/step - loss: 0.7869 - accuracy: 0.6693 - val_loss: 0.6853 - val_accuracy: 0.7216
Epoch 5/20
821/821 [==============================] - 33s 40ms/step - loss: 0.6722 - accuracy: 0.7174 - val_loss: 0.6441 - val_accuracy: 0.7265
Epoch 6/20
821/821 [==============================] - 33s 40ms/step - loss: 0.6532 - accuracy: 0.7233 - val_loss: 0.6324 - val_accuracy: 0.7314
Epoch 7/20
821/821 [==============================] - 33s 40ms/step - loss: 0.6008 - accuracy: 0.7392 - val_loss: 0.5891 - val_accuracy: 0.7431
Epoch 8/20
821/821 [==============================] - 33s 40ms/step - loss: 0.5675 - accuracy: 0.7598 - val_loss: 0.5939 - val_accuracy: 0.7435
Epoch 9/20
821/821 [==============================] - 33s 41ms/step - loss: 0.5467 - accuracy: 0.7670 - val_loss: 0.6037 - val_accuracy: 0.7455
Epoch 10/20
821/821 [==============================] - 33s 41ms/step - loss: 0.5128 - accuracy: 0.7821 - val_loss: 0.5746 - val_accuracy: 0.7543
Epoch 11/20
821/821 [==============================] - 33s 40ms/step - loss: 0.5165 - accuracy: 0.7825 - val_loss: 0.5827 - val_accuracy: 0.7543
Epoch 12/20
821/821 [==============================] - 33s 40ms/step - loss: 0.5051 - accuracy: 0.7918 - val_loss: 0.5769 - val_accuracy: 0.7509
Epoch 13/20
821/821 [==============================] - 33s 41ms/step - loss: 0.4884 - accuracy: 0.7977 - val_loss: 0.5878 - val_accuracy: 0.7572
Epoch 14/20
821/821 [==============================] - 34s 41ms/step - loss: 0.4535 - accuracy: 0.8142 - val_loss: 0.5826 - val_accuracy: 0.7567
Epoch 15/20
821/821 [==============================] - 33s 40ms/step - loss: 0.4644 - accuracy: 0.8140 - val_loss: 0.6396 - val_accuracy: 0.7513
Epoch 16/20
821/821 [==============================] - 33s 40ms/step - loss: 0.4454 - accuracy: 0.8150 - val_loss: 0.6033 - val_accuracy: 0.7572
Epoch 17/20
821/821 [==============================] - 33s 41ms/step - loss: 0.4358 - accuracy: 0.8229 - val_loss: 0.7108 - val_accuracy: 0.7450
Epoch 18/20
821/821 [==============================] - 33s 40ms/step - loss: 0.4223 - accuracy: 0.8228 - val_loss: 0.6391 - val_accuracy: 0.7562
Epoch 19/20
821/821 [==============================] - 33s 41ms/step - loss: 0.4328 - accuracy: 0.8198 - val_loss: 0.6254 - val_accuracy: 0.7489
Epoch 20/20
821/821 [==============================] - 33s 41ms/step - loss: 0.4180 - accuracy: 0.8253 - val_loss: 0.6633 - val_accuracy: 0.7572
Training Accuracy: 0.8497
Testing Accuracy:  0.7572
In [ ]:
result_train = model.predict([X_train,X_train2])
pr_train = [np.argmax(x) for x in result_train]
result_test = model.predict([X_test,X_test2])
pr_test = [np.argmax(x) for x in result_test]
In [ ]:
print(classification_report(y_train_true, pr_train))
print(classification_report(y_test_true, pr_test))
              precision    recall  f1-score   support

           0       0.92      0.90      0.91      3258
           1       0.63      0.91      0.74      1788
           2       0.99      0.76      0.86      3157

    accuracy                           0.85      8203
   macro avg       0.85      0.86      0.84      8203
weighted avg       0.88      0.85      0.86      8203

              precision    recall  f1-score   support

           0       0.80      0.84      0.82       834
           1       0.54      0.68      0.60       465
           2       0.92      0.72      0.81       752

    accuracy                           0.76      2051
   macro avg       0.75      0.74      0.74      2051
weighted avg       0.78      0.76      0.76      2051

Job Title - Support Vector Machine (SVM)

In [ ]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KernelDensity
from sklearn.decomposition import TruncatedSVD, NMF, PCA, LatentDirichletAllocation
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans, DBSCAN
from sklearn.manifold import TSNE
from sklearn.feature_extraction import text
from sklearn.preprocessing import normalize, LabelEncoder
from nltk.stem.snowball import SnowballStemmer
import pickle
In [ ]:
# Base - SVM
text_clf_svm = Pipeline([('vect', CountVectorizer()), 
                         ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='hinge', 
                                                   penalty='l2',
                                                   alpha=1e-3,
                                                   random_state=1))])

text_clf_svm = text_clf_svm.fit(X_train['JobTitle'], y_train)
predicted_svm = text_clf_svm.predict(X_test['JobTitle'])
np.mean(predicted_svm == y_test)

# Grid Search
# Here, we are creating a list of parameters for which we would like to do performance tuning. 
# All the parameters name start with the classifier name (remember the arbitrary name we gave). 
# E.g. vect__ngram_range; here we are telling to use unigram and bigrams and choose the one which is optimal.
parameters = {'vect__ngram_range': [(1, 1), (1, 2)], 
              'tfidf__use_idf': (True, False), 
              'clf__alpha': (1e-2, 1e-3)}

# Similarly doing grid search for SVM
parameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)], 
                  'tfidf__use_idf': (True, False),
                  'clf-svm__alpha': (1e-2, 1e-3)}

gs_clf_svm = GridSearchCV(text_clf_svm, parameters_svm, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(X_train['JobTitle'], y_train)
pickle.dump(gs_clf_svm, 
            open(root_path + '/data/jt_svm_base.sav', 'wb'))

print(gs_clf_svm.best_score_)
print(gs_clf_svm.best_params_)
0.8639516356772343
{'clf-svm__alpha': 0.001, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)}
In [ ]:
# NLTK - SVM
# Removing stop words
text_clf_svm = Pipeline([('vect', CountVectorizer(stop_words = 'english')), 
                         ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='hinge', 
                                                   penalty='l2',
                                                   alpha=1e-3,
                                                   random_state=1))])

text_clf_svm = text_clf_svm.fit(X_train['JobTitle'], y_train)
predicted_svm = text_clf_svm.predict(X_test['JobTitle'])
np.mean(predicted_svm == y_test)

parameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)], 
                  'tfidf__use_idf': (True, False),
                  'clf-svm__alpha': (1e-2, 1e-3)}

gs_clf_svm = GridSearchCV(text_clf_svm, parameters_svm, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(X_train['JobTitle'], y_train)
pickle.dump(gs_clf_svm, 
            open(root_path + '/data/jt_svm_stopwords.sav', 'wb'))

print(gs_clf_svm.best_score_)
print(gs_clf_svm.best_params_)
0.864439291924912
{'clf-svm__alpha': 0.001, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)}
In [ ]:
# define a class to build a CountVectorizer on a stemmed text using Snowball
# stemmer. this is important to run even though we might be loading the model 
# from disk
class StemmedCountVectorizer(CountVectorizer):
  def build_analyzer(self):
    analyzer = super(StemmedCountVectorizer, self).build_analyzer()
    return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

stemmer = SnowballStemmer("english", ignore_stopwords=True)
In [ ]:
# Stemming Code - SVM
stemmer = SnowballStemmer("english", ignore_stopwords=True)

stemmed_count_vect = StemmedCountVectorizer(stop_words='english')

text_svm_stemmed = Pipeline([('vect', stemmed_count_vect), 
                             ('tfidf', TfidfTransformer()), 
                             ('clf-svm', SGDClassifier(loss='hinge', 
                                                   penalty='l2',
                                                   alpha=1e-3,
                                                   random_state=1))])

text_svm_stemmed = text_svm_stemmed.fit(X_train['JobTitle'], y_train)
predicted_svm_stemmed = text_svm_stemmed.predict(X_test['JobTitle'])
np.mean(predicted_svm_stemmed == y_test)

parameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)], 
                  'tfidf__use_idf': (True, False),
                  'clf-svm__alpha': (1e-2, 1e-3)}

gs_clf_svm = GridSearchCV(text_svm_stemmed, parameters_svm, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(X_train['JobTitle'], y_train)
pickle.dump(gs_clf_svm, 
            open(root_path + '/data/jt_svm_stemmed.sav', 'wb'))

print(gs_clf_svm.best_score_)
print(gs_clf_svm.best_params_)
0.863463756484
{'clf-svm__alpha': 0.001, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)}
In [ ]:
# loading models
jt_svm_base = pickle.load(open(root_path + '/data/jt_svm_base.sav', 'rb'))
jt_svm_stopwords = pickle.load(open(root_path + '/data/jt_svm_stopwords.sav', 'rb'))
jt_svm_stemmed = pickle.load(open(root_path + '/data/jt_svm_stemmed.sav', 'rb'))
In [ ]:
# base SVM classification report on the testing data
print(classification_report(y_test, jt_svm_base.predict(X_test['JobTitle'])))
print(confusion_matrix(y_test, jt_svm_base.predict(X_test['JobTitle'])))
                  precision    recall  f1-score   support

Business Analyst       0.97      0.91      0.94       822
    Data Analyst       0.67      0.95      0.79       470
  Data Scientist       0.97      0.78      0.86       759

        accuracy                           0.87      2051
       macro avg       0.87      0.88      0.86      2051
    weighted avg       0.90      0.87      0.88      2051

[[747  71   4]
 [  5 448  17]
 [ 21 145 593]]
In [ ]:
# stemmed SVM classification report on the testing data
print(classification_report(y_test, jt_svm_stemmed.predict(X_test['JobTitle'])))
print(confusion_matrix(y_test, jt_svm_stemmed.predict(X_test['JobTitle'])))
                  precision    recall  f1-score   support

Business Analyst       0.97      0.91      0.94       822
    Data Analyst       0.67      0.95      0.79       470
  Data Scientist       0.96      0.78      0.86       759

        accuracy                           0.87      2051
       macro avg       0.87      0.88      0.86      2051
    weighted avg       0.90      0.87      0.88      2051

[[746  72   4]
 [  5 447  18]
 [ 19 146 594]]
In [ ]:
# removed stopwords SVM classification report on the testing data
print(classification_report(y_test, jt_svm_stopwords.predict(X_test['JobTitle'])))
print(confusion_matrix(y_test, jt_svm_stopwords.predict(X_test['JobTitle'])))
                  precision    recall  f1-score   support

Business Analyst       0.97      0.91      0.94       822
    Data Analyst       0.67      0.95      0.79       470
  Data Scientist       0.97      0.78      0.86       759

        accuracy                           0.87      2051
       macro avg       0.87      0.88      0.86      2051
    weighted avg       0.90      0.87      0.88      2051

[[747  72   3]
 [  5 448  17]
 [ 21 145 593]]

Job Description - Support Vector Machine (SVM)

In [ ]:
# Base - SVM
# we first define a pipeline that takes as input the raw JobDescription text and
# applies a CountVectorizer, a TF-IDF transformation, and a SVM, sequentially.
text_clf_svm = Pipeline([('vect', CountVectorizer()), 
                         ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss = 'hinge', 
                                                   penalty = 'l2',
                                                   alpha = 1e-3, 
                                                   random_state = 1))])

# the pipeline is fitted on the train data and an untuned accuracy is printed.
text_clf_svm = text_clf_svm.fit(X_train['JobDescription'], y_train)
predicted_svm = text_clf_svm.predict(X_test['JobDescription'])
print(np.mean(predicted_svm == y_test))

# we then define a grid of parameters for the CountVectorizer, TF-IDF, and SVM 
# to perform a grid search for hyperparameter tuning and calculate the best
# results based on the training dataset
parameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)], 
                  'tfidf__use_idf': (True, False),
                  'clf-svm__alpha': (1e-2, 1e-3)}

gs_clf_svm = GridSearchCV(text_clf_svm, parameters_svm, n_jobs = -1)
gs_clf_svm = gs_clf_svm.fit(X_train['JobDescription'], y_train)

# we pickle the model to avoid having to run it again
pickle.dump(gs_clf_svm, 
            open(root_path + '/data/jd_svm_base.sav', 'wb'))

# these are the best scores and parameters that resulted from the grid search
print(gs_clf_svm.best_score_)
print(gs_clf_svm.best_params_)
0.7191613846903949
0.7252227226111384
{'clf-svm__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 1)}
In [ ]:
# NLTK - SVM
# Removing stop words. We perform the exact same pipeline, but in this case
# we remove the common english stop words.
text_clf_svm = Pipeline([('vect', CountVectorizer(stop_words = 'english')), 
                         ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='hinge', 
                                                   penalty='l2',
                                                   alpha=1e-3, 
                                                   random_state=1))])

text_clf_svm = text_clf_svm.fit(X_train['JobDescription'], y_train)
predicted_svm = text_clf_svm.predict(X_test['JobDescription'])
print(np.mean(predicted_svm == y_test))

parameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)], 
                  'tfidf__use_idf': (True, False),
                  'clf-svm__alpha': (1e-2, 1e-3)}

gs_clf_svm = GridSearchCV(text_clf_svm, parameters_svm, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(X_train['JobDescription'], y_train)
pickle.dump(gs_clf_svm, 
            open(root_path + '/data/jd_svm_stopwords.sav', 'wb'))

print(gs_clf_svm.best_score_)
print(gs_clf_svm.best_params_)
0.719648951730863
0.7471653215618079
{'clf-svm__alpha': 0.001, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)}
In [ ]:
# Stemming Code - SVM
# in this final model, we remove stopwords and we stem the words to their roots
# using the previously defined class
stemmer = SnowballStemmer("english", ignore_stopwords=True)
    
stemmed_count_vect = StemmedCountVectorizer(stop_words='english')

text_svm_stemmed = Pipeline([('vect', stemmed_count_vect), 
                             ('tfidf', TfidfTransformer()), 
                             ('clf-svm', SGDClassifier(loss='hinge', 
                                                   penalty='l2',
                                                   alpha=1e-3, 
                                                   random_state=1))])

text_svm_stemmed = text_svm_stemmed.fit(X_train['JobDescription'], y_train)
predicted_svm_stemmed = text_svm_stemmed.predict(X_test['JobDescription'])
print(np.mean(predicted_svm_stemmed == y_test))

parameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)], 
                  'tfidf__use_idf': (True, False),
                  'clf-svm__alpha': (1e-2, 1e-3)}

gs_clf_svm = GridSearchCV(text_svm_stemmed, parameters_svm, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(X_train['JobDescription'], y_train)
pickle.dump(gs_clf_svm, 
            open(root_path + '/data/jd_svm_stemmed.sav', 'wb'))

print(gs_clf_svm.best_score_)
print(gs_clf_svm.best_params_)
0.7240370550950755
0.7503344926502281
{'clf-svm__alpha': 0.001, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)}
In [ ]:
# loading models
jd_svm_base = pickle.load(open(root_path + '/data/jd_svm_base.sav', 'rb'))
jd_svm_stopwords = pickle.load(open(root_path + '/data/jd_svm_stopwords.sav', 'rb'))
jd_svm_stemmed = pickle.load(open(root_path + '/data/jd_svm_stemmed.sav', 'rb'))
In [ ]:
# base SVM classification report on the testing data
print(classification_report(y_test, jd_svm_base.predict(X_test['JobDescription'])))
print(confusion_matrix(y_test, jd_svm_base.predict(X_test['JobDescription'])))
                  precision    recall  f1-score   support

Business Analyst       0.73      0.90      0.80       822
    Data Analyst       0.63      0.23      0.33       470
  Data Scientist       0.73      0.83      0.77       759

        accuracy                           0.72      2051
       macro avg       0.69      0.65      0.64      2051
    weighted avg       0.70      0.72      0.69      2051

[[741  23  58]
 [188 107 175]
 [ 91  41 627]]
In [ ]:
# stemmed SVM classification report on the testing data
print(classification_report(y_test, jd_svm_stemmed.predict(X_test['JobDescription'])))
print(confusion_matrix(y_test, jd_svm_stemmed.predict(X_test['JobDescription'])))
                  precision    recall  f1-score   support

Business Analyst       0.76      0.88      0.82       822
    Data Analyst       0.62      0.36      0.46       470
  Data Scientist       0.75      0.82      0.78       759

        accuracy                           0.74      2051
       macro avg       0.71      0.69      0.69      2051
    weighted avg       0.73      0.74      0.72      2051

[[722  41  59]
 [151 171 148]
 [ 75  62 622]]
In [ ]:
# removed stopwords SVM classification report on the testing data
print(classification_report(y_test, jd_svm_stopwords.predict(X_test['JobDescription'])))
print(confusion_matrix(y_test, jd_svm_stopwords.predict(X_test['JobDescription'])))
                  precision    recall  f1-score   support

Business Analyst       0.76      0.88      0.82       822
    Data Analyst       0.62      0.31      0.42       470
  Data Scientist       0.74      0.83      0.78       759

        accuracy                           0.73      2051
       macro avg       0.70      0.68      0.67      2051
    weighted avg       0.72      0.73      0.71      2051

[[726  37  59]
 [156 148 166]
 [ 76  54 629]]